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ABSTRACT 

Dynamic systems have _ to enormous size and complexity. The 
ability to simulate the systems has greatly helped in the design and op- 
eration of these systems. Inspite of advancements in model simplification 
techniques and refinements in solution procedures, many of todays systems 
are prohibitively expensive to simulate with the required accuracy. It 
Was felt that the equations describing the system could be solved in paral- 
lel with savings in computer time. To this end the solution process was 
studied to find large scale (macro) parallelism. Macroparallelism was 
found both in the equations of the model and in the solution processes. 
Once this parallelism was found, requirements for and gains achievable by 
multiprocessors which use this inherent parallelism were developed. 

An efficient set of model equations is obtained by converting the 
differential equations to algebraic equations by the implicit multi-step 
integration fomulas. The entire set of equations is then solved by an 
iterative algorithm. The Chaotic Relaxation and Newton-SOR Algorithms 
exhibit 2 high degree of parallelism which can be increased by ordering 
the equations to the near block diagonal form. This ordering is possible 
because of the sparsity present in models of large systems. The Gauss- 
Seidel and true Newton Algorithms are not obviously executable in parallel, 
but by ordering the equations to a bordered block diagonal form parallelism 
is exposed. 

The requirements for sharing data are dictated by the form of the 
equations, while the control requirements depend on the algorithms. The 
near block dieggonal form algorithms exchange solution data through shared 


memory, and it is the contention over this shared memory which limits the 





gains achievable by parallel execution. Simple high order multiproces- 
sors can efficiently ee the Chaotic Relaxation and Newton-SOR Al- 
gorithms. The bordered block diagonal form algorithms require that 

the processors solving the diagonal blocks alternate active solution 
periods with the processor solving the border block (cut-set processor). 
Mme cut-set Peeenee exchaneee all required information while the other 
processors are idle. The gains achievable are limited by the time re- 
quired for the cut-set processor to solve the equations. Simple high 
order multiprocesSors are developed to efficiently execute the Gauss- 
Seidel and Newton Algorithms, but the gains achievable are dependent on 
the extent to which the equations decompose to the bordered block diagonal 
forms. 

The algorithms were programmed to determime the control and data 
sharing requirements. From these requirements the delays from parallel 
solution were estimated. The convergence rates of the algorithms were 
not altered by parallel solution, so that a rough ranking of the paral- 


lel solution techniques was made. 
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CHAPTER I 


INTRODUCT ION 


Dynamic simulation is the prediction of the way in which a system 
will change with time. The system can be as simple as a resistor/capacitor 
filter or as complex as a modern oil refinery. For the RC filter, dynamic 
simulation would predict the voltage levels within the filter for a future 
time, based on the known starting voltages and input voltages. For an oil 
refinery, dynamic simulation might be used to predict what change in the 
quantity of a specific output would result from a change in the inputs, 
again based on the initial conditions and the other inputs. 

The equations describing the system, called the model, are developed 
from the mathematical relationships within the system. Dynamic simulation 
is the solution of these equations which predicts the response of the 
actual system under the same conditions. 

The development of the computer has enhanced the ability to solve the equa- 
tions, allowing the system which can be simulated tc grow in size and com- 
plexity. Without the computer to solve the equations, it would be nearly 
impossible to predict the response of large systems except by building and 
testing the actual system. The large complex systems of today can be de- 
signed and built, because the computer allows the actual system response 
to be predicted from the mathematical models of the system. 

As the size and complexity of the models has grown, the cost of simu- 
lation has increased. The cost is directly related to the time required to 


solve the equations defining the system and the size of the equations. There 





are uses of dynamic simulation where the time required to solve the problem 
is critical for reasons other than cost. An example is the use of dynamic 
simulation to provide control signals for the system. In this case, the 
changes in the system must be predicted in time to allow the appropriate 
controls to be instituted. If the solution cannot be found in real time, 
it cannot be used to prevent undesirable changes in system variables. 

This thesis investigates the possibility of using a multiprocessor to 
reduce the time required to solve the dynamic simulation problem. The multi- 
processor is basically many independent computers which can efficiently ex- 
change control and solution information. The investigation begins by de- 
veloping the parallelism of the algorithms used to solve the dynamic simu- 
lation provdlenm and the parallelism ftnferent in the structure of tié équations 
in the model. It then develops the computational requirements of the parallel 
algorithms and finally proposes and evaluates multiprocessor structures to 
execute the algorithms. 

The goal of the use of multiprocessor structures is to achieve a solution 
in Win. the time required for a single processor, where n is the number of 
computing elements in the multiprocessor. The restrictions on the size of 
mn depend on the method of solution and the structure of the multiprocessor. 
It is shown that because of the small amount of information exchange needed 
for solution of the dynamic simulation problem, a large number of processors 


can be used efficiently. 


I.A4. Dynamic Simulation Problem 


Simulation techniques were developed before the computer and involve 





more than the analysis of a ayaa system. The first technique to be de- 
veloped was iconic simulation, the use of scale models to study a system. 

The most recent development is discrete event simulation using random number 
generators to model probabilistic events. This thesis studies a third type - 
dynamic simulation - which is among the oldest types of simulation. [1] 

The more recent discrete event simulation is used within the study of dynamic 
simulation. 

Dynamic simulation consists of establishing the mathematical relation- 
ships between elements of the system, then solving the equations to predict 
the changes which will occur in the system. This thesis only studies methods 
of solving the equations defining the system. The development of the equa- 
tions which define large systems usually is not complicated. Once the mathe- 
matical relationships for the elementary parts of the system are known, 
their equations can easily be combined to develop the model describing the 
system. 

Normally, the equations defining elementary systems can be solved ana- 
lytically. But when complex systems are modelled, analytical solutions are 
available only for simplified linear representations. The first method de- 
veloped for solving these equations was the use of numerical methods to 
approximate the solution. Originally, it took teams of mathematicians long 
periods of time to solve the models of what would be considered a simple 
system today. In 1922, the differential analyzer was invented. It allowed 
mechanical devices which integrate differential equations to be built. Scon 
thereafter the electronic analog computer replaced the mechanical version. 
The electronic analog computer allowed the solution of much larger and more 


complex systems models. The analog computer lacked the accuracy required 





for solution of models over a long period of time. Further, it may require 
a difficult and time consuming process to program the model on the analog 
computer. When the first digital computer was built, one of the initial 
applications was dynamic simulation. As the speed of the digital computer 
improved, hybrid computers, a combination of digital and analog computers, 
were designed and built. The hybrid computer combined the speed of solution 
possible on analog computers, with the accuracy available in digital computers, 
to simulate complex models over long periods of time. As the speed of the 
digital computer improves it is becoming capable of solving the high fre- 
quency component equations previously only economically solved by analog 
computer. 

Although the digital computer has many advantages over the analog 
computer, it is expected that analog computation will remain. Just as the 
digital computer has been improved so has the analog computer. The analog 
computer started cut a higher level of scphistication so further improve- 
ments were more difficult and little research has been done of the ana- 
log computer. Digital simulation has been studied to find parallel solution 
methods, a feature analog computation has always used. 

Dynamic simulation has become an extremely helpful tool in all phases 
of system studies, development, operation, and experimentation. The system 
designer can use dynamic simulation to insure the satisfactory completion 
of the system. Simulation proves the feasibility of a proposed system. The 
specifications for the parts of the system are developed by simulating the 
Operation of the system. Simulation studies help develop operating procedures 
for the system by testing its response to numerous possible conditions called 


contingency planning. Through the use of dynamic simulation, a system can 





be designed and developed with considerable confidence. 

Dynamic simulation is also useful for the operation of a system. For 
some systems the only means of providing control information is through 
simulation. For control purposes the speed with which the dynamic simu- 
lation problem can be solved is critical. The response of the system must 
be predicted in time for control signals to be instituted which prevent un- 
wanted changes in the output. 

An example of this use of dynamic simulation for control is the 
newest gun-fire control system used by the Navy. This has typically been 
an exclusive analog application. However, the speed and maneuverability of 
modern aircraft and the need to compute the solution for multiple targets 
has resulted in the use of a digital computer. 

Scientists afso ase dynamic simatation. It provides ad means for 
scientists to perform experiments which are too costly to develop on the 
real system, or which might result in grave damage to the system if actually 
performed. Simulation can also provide information on parts of the system 
where measurements cannot be made. 

For all these and other uses, dynamic simulation provides more infor- 
mation at less cost than other possible approaches. Without the ability to 
solve the dynamic simulation problem many modern achievements would not be 
possible. 

The electric power industry is one of the largest users of dynamic 
simulation. Because of the high fixed costs and high reliability required 
in this industry, it has led the search for improvements in modeling. 
The power industry must be positive that the changes it makes in any system 


will not have any adverse effects on the system's operation. This is 





accomplished by simulating the effect of the proposed changes. 

A model for a power system consists of the interconnection of smaller 
models representing the generating stations and loads, and the network 
interconnecting these parts. The model of a generating station may contain 
from two to thirty differential equations depending on the degree of com- 
plexity required for the simulation. The equations describe the actions 
of the various parts of the station as a result of changes in the power re- 
quirements of the network at the location of the station. As the complexity 
of the model is increased (usually to allow a longer period of time to be 
accurately simulated),more parts are included in the model. For example, 
at the lowest level, a model for a generating system might be only a simple 
representation of the generator. If more complexity is desired, the generator 
model would he expanded and the voltage regulator added. Then the turbine 
and governor, and finally the boiler would be included to give a more com- 
plex model. 

The vastly different uses of electric power result in a wide variety 
in the models which represent these loads. The simplest representation is 
by a constant power requirement at each node. Other simple representations 
are a constant current requirement or constant impedance. When the power 
requirements of load change with time, more complex models are needed. Then 
differential equations, similar to the generator station models, are required. 
An example of such a load is a large induction motor. 

The network supplying the electrical power to the loads from the gener- 
ators is represented by algebraic equations. Points where a generator or 2 
load is connected to the network or where several power lines of the network 


are ccnnected is known as a bus. Each bus requires a complex algebraic 
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equati»1 to represent its effect on the time scale of interest. Normally 
a bus has only a few power lines and/or a generator or load connected to 
it, and only the variables for the connected parts occur in the equation 
for the bus. Even though there may be hundreds of busses in the model, 
the equation for each bus seldom depends on more than ten variables, and 
Often fewer. The busses are all interconnected but one bus never 
connects to all of the other busses. 

The models of the different parts of the power system are easily com- 
bined to form the complete model. The generator station equations are 
made to depend on the voltage of the bus which connects the generator to 
the remainder of the system. Similarly, the nonconstant load models de- 
pend on the voltage of their connecting bus. If a bus is to be added, 
then the algebraic equation is included and its variable added to all bus 
equations which have power lines connecting the existing busses to the new 
bus. The addition of a power line requires adding new coefficients to 
the busses it connects. 

When a model must represent the power system of a large utility company, 
which may cover a multi-state region, the number of equations in the model 
becomes quite large. It is not difficult to combine the individual parts 
to form these large models. Normally, as the parts are more distant from 
the object to be studied, the models are simplified. Since the generator 
and load models only depend on the bus to which they are connected, and 
any one bus only connects to a few other busses, the equations of the model 
exhibit a property known as sparsity. Sparsity is the occurrence of only 


a small number of the possible variables in each equation and occurs fre- 


quently in large sets of equations. For power systems, sparsity results 





: 


from the small number of busses connected to any one bus. 

For the power system the variables represent current, voltage, power, 
and control signals. Nonlinearities occur throughout the system primarily 
as a result of the representation of limits and saturation. A total model 
may consist of thousands of nonlinear differential and algebraic equations. 

Any large dynamic simulation problem exhibits many of the properties 
described for the power systems. The mathematical relationships are de- 
fined by large sets of algebraic and differential (or difference) equations. 
Quite often these equations are simplified to the linear form to ease so- 
lution,but the nonlinearities sometimes have to be included in the model 
to obtain accurate results. Sparsity also exists in most large dynamic 
simulation problems. 

This thesis a@ssuiiés the following f0rii 6f the differentidlt afidy digé- 


braic equations for the dynamic simulation problem: 


E Sen (Ce) yy (e) u(t) 5st) 
| al 


0 = G(x(t),y(t),u(t),t) 


Where x is the state vector, y the algebraic vector, u is the vector of 
inputs, and t is time. 

This research uses models of electric power systems wherever examples 
are required. These models are typical of all dynamic simulation problems, 
and the techniques proposed by this thesis can be used to reduce the time 


required to solve the equations of any dynamic simulation problem. 





E.B Computer Solution of the Dynamic Simulation Problem 


One of the first uses and still one of the largest uses of the com- 
puter is the simulation of dynamic systems. Certainly without the computer, 
simulation would not have progressed to itg current widespread use. How- 
ever, the systems analyst has been able to increase the size and complexity 
of the models faster than the computer has progressed. The time required 
by currently available computers restricts the size of the model to be 
simulated and the availability of the solution. Because of the ease of 
developing large models and the degree of interconnection of power systems, 
there is a need to solve larger dynamic simulation problems at a faster rate. 

Many people have tried to reduce the time required for solution of the 
dynamic simulation problem. Some, Gear [4], Davison [5], Dommel and Sato 
[6], and Wu [7] have approached the probDreéi by developiig tiew algorftinis 
for approximating the solution. Others,Davison [8], Chidambara [9], Undrill 
and Turner (Midd Anderson [11], and Van Ness [12] have developed methods 
of reducing the size of the model without losing vital’ information. Neither 
group has achieved sufficient success to relieve the modeler's concern over 
the costs of performing dynamic simulation. This leaves a vast demand for 
the ability to simulate larger systems at more reasonable speeds. 

As early as 1959, Gauss [14] realized that for a computer to solve a 
problem most efficiently, the structure of the computer must be based on 
the requirements of the problem. Lehman [15] and Rosenfeld [16] suggested 
that a problem must be examined thoroughly to discover the inherent parallel- 
ism which can be used effectively, rather than to try to find parallelism 
at the instruction level. Korn [17] first suggested the dynamic simulation 


problems suitability for parallel processing while Lehman [15] & Rosenfeld [16] 
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have examined the solution of systems of algebraic equations. None of them 
developed the ideas far enough to demonstrate the actual gains or the re- 
sulting computer structure, nor did they investigate the problems which 
might occur. Much more recently, Wu [7] has again suggested the multiproc- 
essor structure for the solution of power system problems. Wu's efforts, 
however, are mainly in structuring the ''diakoptics" type algorithms, which, 
unfortunately are applicable only to linear systems. 

There have also been efforts to develop parallel numerical methods to 
solve differential equations. Miranker and Liniger [18]. structured pre- 
dictor-corrector type integration methods so that the prediction and correc- 
tion equations could be executed on different processors. Nievergelt [19] 
proposed using a large parallel processor to integrate a single differential 
@qtetion by using different processors for dtfferent time tntervals. These 
methods found a much lower level of parallelism than examined by this thesis. 

This thesis will examine the dynamic simulation problem to determine 
the major parallel paths available during execution which may be exploited 
to increase the speed of execution. Computing structures will be proposed 
which best suit the inherent parallelism. Execution speeds for these struc- 
tures will be estimated. The underlying hypothesis is that the computing 
structure which most resembles the structure of the dynamic simulation prob- 


lem will provide the most efficient execution of the model. 


I.C Proposed Computer Development 


Predominant computer technology dictates that problems must be solved 
in a serial fashion. Yet Gonzalez and Ramamoorthy [23], Baer and 


Russell fe? | and others have shown that many problems can be solved in 


' 
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parallel and have developed a graph theoretic methods of demonstrating 

the parallelism inherent in a problem or computer program. The numer- 
ical methods used to solve the dynamic simulation problem are readily ana- 
lyzed by these techniques and may be shown to exhibit a very high degree of 
parallelism. 

Not only has parallelism been found in computer programs but also advanced 
parallel computer designs have been formed and built. ILLIAC IV [24, 25, and 26] 
a parallel processor, was built and many other designs proposed (some were 
built). ILLIAC IV, in the terminology developed by Flynn [27], is a single 
instruction multiple data (SIMD) computing structure. That is, all proc- 
essors execute the same instruction, but on different parcels of data. 
Another category proposed by Flynn is a multiple instruction multiple data 
(MIMD) computing structure. C.mmp [28] is an MIMS computer developed at 
Carnegie Mellon University specifically to study artificial intelligence 
problems. For equivalent processor features, the MIMD structure is more ex- 
pensive but has greater capabilities than the SIMD structure. This thesis 
attempts to extract the successes and avoid the failures of these studies, 
in order to reduce the solution time of the dynamic simulation problem. 

The set of equations representing a dynamic system consists of a 
large number of simultaneous nonlinear differential (or difference) and 
algebraic equations. Each equation defines a specific variable. The set 
of equations exhibit a high degree of sparseness and in general includes 
nonlinear terms scattered throughout. 

To reduce the time required to solve the large sets of equations of the 
dynamic simulation problem, methods are developed to solve many of the equa- 


tions concurrently. Since each equation requires different operations to be 
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performed on different data and there is no correlation between either the 
operations to be performed or the data to be used among the equations 
Solving the equations concurrently requires independent control of the 
processing elements both for the operations to be performed and for the 
data to be used. In Flynn's terminology, the computing structure has to 

be of the MIMD form. Loosely linked processors each capable of executing 
its own independent computer program are required. The other computing 
structures fail in some respect to be able to solve completely concurrent ly 
the equations of the dynamic simulation problem. 

The major problem in all parallel computation is the sharing of infor- 
mation between the parallel processes. This sharing can delay the solution 
process in two ways. A processor may have to idle until the information 
it requires is computed by another processor. Or a processor may be delayed 
by the physical restriction that only one processor can use a single resource 
of the multiprocessor. Most often this single resource is shared memory and 
the delay is known as memory contention. 

This thesis proposes methods to minimize the delays normally resulting 
from multiprocessor solution by reducing the information which must be shared. 
The primary method used to reduce information sharing is to provide each 
processor with a private copy of all information available at the start of 
the solution process. This information is stored in a local memory which 
only one processor can access,to insure no delays will be encountered in 
accessing this data. (Rosenfeld [16] mentioned the use of local memories, 
but did not analyze their effects.) 

The other method of reducing the amount of information which must be 


shared depends on the properties of the equations of the dynamic simulation 
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problem. By properly grouping the equations into blocks, the amount of 
information which must be shared between blocks can be minimized. 

Computing structures are proposed and analyzed to determine the delays 
which will result from exchanging data. From this analysis it is possible 
to estimate the point where, with the addition of more processors, the re- 
sulting increase in the delays from sharing data will negate the gains in 
execution speed. 

This analysis requires actually programming the numerical methods 
used to solve the dynamic simulation problem. From the programs, accurate 
estimates of the amoung of shared information and the time available to 
exchange the information can be obtained. Time is measured in memory cycles 
to make the analysis independent of the actual processor used. The delays 
are based on the ratio cf the number of wiéiory accéssés requiritig shared 
data to the total number of memory references. The actual decrease in so- 
lution time is based on the number of memory references in the parallel path 
plus the delays encountered,compared to the number of memory references a 
single processor would require. The numerical methods are presented in 
Chapter Two and analyzed in Chapter Three. 

Chapter Four proposes multiprocessor structures which can exploit the 
parallelism found in the dynamic Sianilat wen problem. The structures are 
analyzed to determine the delays encountered and the time required for 
solving the dynamic simulation problem. 

Chapter Five discusses further methods of reducing the sharing of data 
and decreasing the time required for solution of the dynamic simulation 


problem. 





CHAPTER ITI 


NUMERICAL SOLUTION METHODS 


The dynamic system model under consideration consists of a large set 
of simultaneous equations of two types (see equation I.A.1). The first 
type is a first order differential equation, which may be nonlinear. With 
the differential equations, initial conditions must be specified. The 
second type is an algebraic equation, which also may be nonlinear. Because 
of the nonlinearities, direct solution is normally not possible, and iterative 
methods are required to find an approximate solution. 

There are three methods of solving a set of algebraic and differential 
equations. Either the algebraic equations can be eliminated and the resulting 
equations solved, or the equations can be solved separately and the solutions 
matched for consistency, or the difference equations which approximate the 
solution of the differential equations may be combined with the algebraic 
equations and the entire set can be treated as algebraic equations ré]. 

Elimination of the algebraic equations is a long and difficult process. 
The resulting differential equations are nonsparse and require many more 
operations to solve than the original equations. When the equations are 
solved separately, the sparsity remains and the solution process is efficient 
unless the separate solutions are inconsistent. If this happens, the time re- 
quired to find this solution has been wasted, and a new attempt must be made 
to find a consistent solution. Experience indicates that inconsistent solu- 
tions frequently occur in practice. 

Recently Dommel and Sato [6] showed that the implicit integration 
schemes have many advantages. These schemes require an iteration process 


to find the next estimate; this allows the difference equations to be solved 
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concurrently with the algebraic equations. The concurrent iteration process 
insures that the solution found is consistent. The sparsity of the equations 
is maintained to include the block parallelism. Finally the implicit inte- 
gration schemes allow the step size to be increased beyond that possible 

for the other methods. Since the solution of the entire set of equations 

in this format is no more difficult than wouid be required to solve the 
algebraic equations, the solution of the dynamic simulation problem by ex- 
pressing the differential equations as algebraic equations through the use 


of the implicit integration schemes has become the accepted solution method. 


IL.A Numerical Integration of Differential Equations 


Numerical integration methods are a key factor in the solution process 
of the dynamic simulation problem. Numerical methods were first used before 
mechanical integration machines were developed. Now, even though the actual 
solution algorithms are those for algebraic equations, the numerical inte- 
gration formulas are needed to express the differential equations as alge-~ 
braic equations. In this section the integration methods are presented, 
then in the next section the methods of solving the resulting set of alge- 
braic equations are presented. 

The simplest method of approximating the solution of a differential 
equation is Euler's Method. This method begins with an initial condition 
and estimates a new solution point a small time step away along the direction 
of the initial derivative. At the new estimate, this method evaluates the 
new derivative and again moves a small time step in the direction of the 
derivative for the next estimate. By repeating this procedure a series of 


points is calculated which approximates the solution to the differential 
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equation. The series is calculated successively from the following formula: 


weer) = x(t + Wee), y(t), uct), t) Tre. 1 


where h is the time step. The accuracy of Euler's Method over many time 
steps, depends proportionately (linearly) on the size of the time step. For 
almost all uses this high degree of error is unacceptable. 

To increase the accuracy of the approximation the amount of information 
on which the approximation is based must be increased. Methods with im- 
proved accuracy either use more points as a basis for the new estimate, or 
use more calculations of the derivative, or both. When the methods provide 
direct solution of the new estimate they are said to be explicit methods. 
For increased accuracy, the new estimate is included in the formulas and 
the solution is obtained by iteration. These methods are known as implicit. 

The Runge-Kutta methods use several evaluations of the derivative over 
a single step. These methods are in general the most accurate of the explicit 
methods and are simple to apply. They have a disadvantage in that they re- 
quire that the step size be much smaller than the time constant of the 
highest frequency component of the solution. Quite often these very high 
frequencies are of negligible magnitude and could be reasonably ignored. 
However, the Runge-Kutta Methods require the accurate approximation of the 
high frequency component to be accurate. For this reason the Runge-Kutta 
Methods will not be considered in this thesis. (For a complete discussion 
of Runge-Kutta methods see [4], [29], [36]. ) 

The multi-step methods use the estimate of the solution for severai 


time steps as a basis to predict the next estimate. The explicit: multi-step 
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methods are less accurate than the explicit Runge-Kutta schemes. With 

the multi-step methods the next estimate is easily added as one of the 
points of the basis. The addition of this point greatly increases the 
accuracy of the solution. An added benefit is the ability to increase the 
step size beyond the time constant of the highest frequency component of 
the solution without losing accuracy. An explanation of the differences 


between the explicit and implicit methods is given in [30] as: 


--- an intuitive example can be given by using the first order, 
K = 1, integration method and the differential equation 
X' = -AX AZO. 

In the explicit case, the formula becomes: 


' = o 
n n-l + hx n-l (1 bh )X oy 


: ‘ ‘ “At . 
Since the exact solution X(t) = Ce , A> 0 is a positive 


decreasing function we know that 0 < XT < = or 0 < 1l-Ah < l. 
Therefore h < l/h. 

In the implicit formulation of the same example the inte- 
gration formula becomes: 


acl 


= Lah 


Sl 
n n-l n 


Since X is a positive decreasing function 1+hd > 1 which 
ms gttue ror all h> QO. 

As one can see in this example, in the explicit case the 
step size h is restricted by the size of } while no such obvious 


restriction is indicated in the implicit use. 


The simplest multi-step implicit method is the trapezoid method. 
This method does not provide an analytic estimate of the new solution. 


Tt uses the average of the present and new derivative to step to the new 
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point. Since the new derivative cannot be found without knowledge of the 
new point the estimate of the new point is obtained by iteration. The 


formula for the trapezoid method is: 


x (tth) = x(t) +/2 [rix(c), y(t), ult), t) + Fle +h), 


y (tth), u(tth),t+h)} Il.A.2 


which is an algebraic equation and can be solved by the same techniques 
used to solve the algebraic equations in the model. These methods are 
presented in the next section. 

An easy ,accurate method of starting the solution procedure is to use 
the current position as the first estimate of the new point. With this 
estimate, the first iteration reduces to Euler's Method. When the trape- 
zOid method is solved for many steps the error in the estimate of the solu- 
tion will be on the order of the square of the step size. 

Higher order implicit methods also exhibit the ability to use larger 
step sizes. To some extent, the accuracy of the solution also increases 
with the order of the method,thus allowing larger step sizes for a desired 
solution accuracy. Associated with the use of additional points for pre- 
dicting the new estimate in the higher order methods is an increase in the 
computation required for predicting this estimate. Both the accuracy re- 
quired for the solution and the time available to find that solution must 
be used to establish the step size and order of the method to be used. The 
results of this thesis are based on the trapezoid method, but no assumptions 
are used which would prevent the application of higher order methods. 


(More detailed discussion of the multi-step methods can be found jin [4], 
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egy, 6313.) 

All integration schemes discussed herein remain valid when a set of 
differential equations is to be solved. The only requirement is that all 
equations are integrated Over the same time step together. With the im- 
plicit integration methods, the equation for each variable must be iterated 
until all variables have converged. Apparent convergence of one variable 
is not sufficient, and in fact, that variable may change as the other 
equations are iterated further. Normally, as the number of equations grows, 


so will the number of iterations required for convergence. 


EL.B Numerical Solution of Algebraic Equations 


By use of the implicit integration formulas the equations of the dy- 
namic simulation problem have all been converted to algebraic equations. 
Some of the equations resulting from differential equations and many of 
the original algebraic equations are nonlinear. This normally prevents 
the direct solution of the equations, requiring that the solution be found 
by iteration. This thesis considers two types of iteration methods, the 
linear methods and the Newton methods. The linear iterative methods use 
the fixed point form of the algebraic equations to calculate the new esti- 
mate, and have the property that the present error may be bounded by a con- 
stant factor times the previous estimate'’s error. The Newton methods re- 
quire the evaluation of the function and its derivative, to forma set of 
linear equations whose solution is the new estimate. The solution of these 
equations cause the error of the estimate to be reduced at a quadratic rate. 


In general all iterative methods can be expressed as: 
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k+l k 
ee ty) 
where if y = O(y ) Eeome Oe = G(y ) TT Boe ll 


To simplify the discussion, the remainder of this section will discuss 
the general algebraic equations C = G(y) or more generally v = G(y). 

To develop the linear iterative methods, assume also that the equations 
are linear, v = Gy. Gis then expressed as the matrix sum G = D-L-U where 
D is the elements of the diagonal of G and L and U the elements of the 
lower and upper triangular parts of G respectively. The general iterative 
formula is now 


A = Gi. + i) x + Mie. 2 


The three linear methods to be presented can be expressed in terms of the de- 
compositionof thematrix differonly in the location in the algorithm where 

, Kil». : ; 
the new estimate, y ~, is used in the calculations of the other new esti- 
mates. The Jacobi method solves for all of the new estimates based on the 
last value and then updates the variables to the new estimate. (The cal- 
culations are all based on the same point for each variable.) The Jacobi 


algorithm appears as: 


- -1 
- =p @+ vy" eee ay, 
or 
gon a - 4. py Tiepes 


ys 


The Gauss-Seidel method uses the new estimates of the variables in 
the calculations as soon as the new estimate is found. This frees the 
storage required to save both values of the variables. The Gauss-Seidel 


method appears as: 


+ = — 
y* 1 = (D - L) . oy =] Ly 
or 
k+l k - 
yo = Cy + (0-1) ty II.B.4 


The third method, Chaotic Relaxation, was proposed by Rosenfeld fi6], 
and studied by Chazan and Miranker [31], for the solution of algebraic 
equations by a multiprocessor. Like the Gauss-Seidel method, it updates 
the values of the variables as soon as the new estimate is calculated. 
Because of the use of a multiprocessor for the calculations, the order with 
which the variables is updated is not constant. The decomposition is dif- 
ferent for each iteration. This prevents expressing Chaotic Relaxation in 
the terms of equations I1I.B.3 and 4. 

The development of the linear iterative methods for linear equations 
does not restrict their application to only the linear equations. For non- 

: : ; ; : : : neh 
linear equations, each equation is solved for its diagonal variable (i 
: : ely) ; 
variable in the i- equation). 
k+l 4 k ; : 
y. =% (y.) for j#i II.B.5 
7 eg 
The:rate of convergence of the linear iterative methods can be im- 


proved by the use of an acceleration factor, a. The acceleration factor 
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modifies the update either by reducing or increasing the magnitude of the 
correction applied to the variable. The use of an acceleration factor 
maybe thought of as altering the eigenvalues of the iteration matrix in 
order to increase the rate of convergence. With acceleration the linear 
iterative methods appear as: 


k+1 k k 
ee ce(y ) + (1 -a)y eons 7: II.B.6 


For a < 1 this is known as under relaxation, and for a> 1, over 
relaxation. 

The other iterative methods considered by this thesis are the Newton 
Methods. The Newton algorithms achieve a quadratic convergence rate by 
Protecting along the slope of the eauations at the current. estimate to 
the axis. The intersection gives the new estimate. The slope of the equation is 
computed from the Jacobian, J = 3G(y) / dy. The current value of the equa- 
tions is also required, G(y). The new estimate is: 


k+1 k =| k k 
Ms sa - J Gly ) = O(y) II.B.7 


The evaluation of the equations, G(y), requires approximately the same 
amount of time as one complete linear iteration. By interspersing the 
evaluation of the Jacobian with the evaluation of the functions, the 
Jacobian can usually be calculated with only a few additional operations. 
However, the solution of the resulting equations, J5 = G(y), requires on 
the order of we operations. The increase in the rate of convergence 


usually more than compensates for the increase in the number of operations. 
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A combination of the Newton and linear methods can be used to reduce 
the total number of operations required for solution of the Newton update, 
§. The equations and the Jacobian are evaluated as above. The Jacobian is 
partitioned and the diagonal blocks solved. The off diagonal values are then 
compensated for by iterating the diagonal solution. If iterated until con- 
vergence, then a true Newton update is found. With a set number of iter- 
ations of the Newton update, the convergence rate is less than quadratic. 


This method, known as Newton-SOR, appears as: 


m+] 


~1 k 
J, (sy) +5, 8°) II.B.8 


k+1 k * 
y —ae ie SB 


Where Jy is the diagonal blocks of the Jacobian and J, the off diagonal 


ats 


blocks. §° tse unrque Solution or IT.B.8. 

The Newton-SOR method saves operations because the solution of the 
diagonal blocks requires > sin} operations rather than the va operations. 
(s is the number of shared variables required by that block, n, the number 
of variables in the block, and n the total number of variables). When the 
equations are sparse, the number of operations required to solve for the 
Newton update can be reduced. 

The solution time of the linear methods is reduced by accelerating the 
convergence rate. For the Newton methods the rate of convergence is much 
higher and further improvement is not needed. However, several methods of 
reducing the number of operations required to solve for the Newton update 


have been developed. The first method is LU Factorization. After LU 
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Factorization was developed, it was found that further Savings are possible 
by properly ordering the equations. Another improvement was obtained by 
sparsity programming. 

LU Factorization is an efficient method of solving a set of linear 
equations. The matrix of coefficients is decomposed by elementary trans- 
formations into a lower triangular and upper triangular matrix. The de- 
composition results in two trivial sets of equations which can be solved 
by substitution. If J& = G then LU = J so Lz = G and US = z. This method 
of solution requires the minimum number of operations for a general set of 
equations. [59] 

When the equations are sparse it was found that by arranging the equa- 
tions in the LU Factorizations in the proper order, the number of coefficients 
wifct were filled in (chatigéd from zero to fidtizéero valué) EouId Be reduced. 
The process of permuting the order of the equations to achieve a sparse LU 
Factorization is known as optimal ordering. 

The most widely used optimal ordering expresses the equations by a 
connection graph. The equations defining a variable become nodes with 
branches to all other nodes (variables) which occur in the equation. An 
arbitrary node of lowest degree is chosen as the first node to be elimin- 
ated. The paths which would be removed by the elimination of that node are 
replaced by new branches between the remaining nodes, and the first vari- 
able is deleted. Again the node of minimum degree is chosen for elimination, 
the new branches added, and the node deleted. This process is repeated 
until all nodes are eliminated. The order with which the nodes are elimin- 
ated becomes the order the equations should be included in the LU Factori- 


zation. This ordering does not produce the trve optimal ordering, but the 
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increased computation required to find a closer estimate of the true opti- 
mal ordering is not normally regained in the solution of the equations in 
true optimal order. [33] 

When the number of zero coefficients is very large, many additional 
Operations can be saved by sparsity programming. These techniques omit 
the operations whose results are known to be zero before the operations 
begin. 

Through the use of optimal ordering and sparsity programming, the 
number of operations required to solve for the Newton update can be made 
proportional to n for large, very sparse sets of equations. 

Generally, the linear iterations require a much smaller amount of 
computation than the Newton methods, but also have a much slower conver- 
Gerce rate. AIl three of the linear methods require approximately the 
same amount of computation, and the Jacobi method requires a slight in- 
crease in the amount of memory required, but not nearly as much more memory 
as the Newton methods require. When the methods are arranged into parallel 
form other advantages are displayed. To better be able to choose the algo- 
rithms which are used to solve the dynamic simulation problem, the conver- 
gence properties of the methods must be examined. 

The convergence properties of the algorithms can be divided into two 
categories, the region of convergence, and, given convergence, the rate 
with which the error of the estimate is reduced. The analysis possible on 
the convergence properties and on the number of operations required to com- 
pute a new estimate does not conclusively support one method. Only because 
of the vast reductions in the number of operations required to compute the 


Newton update and the experience with the convergence properties of the 
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Newton method, has it become the accepted solution method for power 
systems solution. Even though the analysis can be extended for the 
multiprocessor solution, only experience will establish the superior 
method. 

For the convergence of the linear iterative methods, the equations 
will be restricted to being irreducibly diagonally dominant. That is, the 
coefficient of the gah variable in the gh equation is larger in absolute 
value than the sum of the absolute values of the seine: coefficients of that 
equation. Because of the structure of the network equations of the power 
systems, diagonal dominance normally occurs. Proper selection of the time 
constant insures diagonal dominance for the differential equations. It is 
well known that the linear methods do not converge for some of the power sys- 
tem models. This is because the diagonal dominance has been destroyed in these 
models by the existence of large capacitor banks and/or other devices which 
cause relatively large off diagonal coefficients. 

A major difficulty in comparing the rate of convergence of these so- 
lution processes is that some methods may converge while other methods 
diverge for the same set of equations. Diagonal dominance is a normal 
property of the network equations, which, if present, allows the convergence 
of the methods to be proven, and the rates of convergence compared. 

For linear diagonally dominant sets of equations the true Newton 
method is equivalent to direct solution, and the Newton-SOR method can be 
shown to converge. In this case the direct solution of the Newton method 
produces the highest convergence rate, followed by the Newton-SOR method. 
The Gauss-Seidel method has the highest convergence rate of the linear 
methods studied followed by Chaotic Relaxation and then the Jacobi method. 


(Actually the Chaotic Relaxation method convergence rate requires all 
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equations to be updated equally.) 

The notion of diagonal dominance can be extended to nonlinear equa- 
tions, called M-functions, and similar results obtained. For nonlinear 
equations, the acceleration must be limited to under relaxation to prove 
convergence. But it is expected that by monitoring the convergence prog- 
ress, over relaxation could normally be used. 

For the Newton methods, the equations must exhibit certain continuity 
and invertability conditions in order to prove convergence. Diagonal 
dominance is not necessary and if singularities exist may not even be 
sufficient for the convergence of the Newton methods. The Newton-SOR 
method does require conditions comparable to block diagonal dominance for 
the convergence of the SOR iterations. 

The remaining part of this section shows the theoretical justification 
for these claims. The proofs are omitted if the results are unchanged from 
the source. The following definitions are required to prove the desired 
results. 

The convergence properties are established on a compact subspace 
Dre R . with the following partial ordering, x S$ y implies x. s Ys for 
amc... ., and x < y implies x = y and x. = y; for at least one 
Me 2 yeeey Ne 

A linear mapping A:R' > R” is denoted by Ac’(R') and may be represented 
by a matrix. A nonlinear mapping A is denoted by A:DC R? + R'. The map - 
pings may also be ordered, where for A, B:D©& R 3 Ro Agee implies 
Ax = Bx ¥xeD-If A, Be2(R') A <= B implies aj Ss bey is) = 125.2 ce 
mapping F:DC RS ee is isotone (antitone) if Fx s Fy (Fx2 Fy) whenever 
x Sy ; x, yeD, and a mapping is inverse isotone if whenever xX S$ y, then 


Fx S Fy. 
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A mapping F is non-negative if Fx 2 0 if x = 0 (also implies F is 

msotone). 
n : , 

For Ae<(R ) the spectral radius, p(A), is defined as the magnitude 

of the largest eigenvalue of A. It is always true that 9(A) s [fal]. 
ee | ————— 

A matrix Aex(R ) is an M-matrix if A exists and is nonsingular, 

and sig moor "2 ,53,.--, n. 


+ 
For Ae<(R), the matrix A is defined as 


a.. > Mee 1D es N 
and 
a.. = -|a..| ee 
By comparing the results from [32], [35], Fear [38], [39] and Vesti aie 
can be shown thet if a matrix A is irreducibly diagonally dominant, then 
7. : we , : ees Legct 
A is an M-matrix and for the iteration matrix B = (I-D A ); 
+ 
p(B) sp(B) <1. 
The linear iterative methods for Gy = v can be expressed in terms of 
matrix sums, G = D-L-U. For the Jacobi method the iteration matrix is 
-1 
B. =D (L+U). With Gauss-Seidel the lower triangular matrix L is in- 


J 
cluded with D so that the iteration matrix becomes B. = (> Saye aie For 


g 
Chaotic Relaxation the iteration matrix changes from iteration to iteration 
by inclusion of different rows of L Ce ae er The matrix is defined as 


B. = Men i o, +U). By the theorems of regular splittings [37] the 


following comparison can be made. 
Po = 0€B) = p(B, 
p(B.) < p(B) < p(B,) 


Also the convergence of the SOR iterations of the Newton-SOR method can 


be shown. 
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th _ e i 1 * 
The error of the i LEemeaeew swe = y sey Farid prodttee@e a” error 


itl ne : ; 
e =|Ble on the next iteration. It can be shown that ae? | Ss 


of 
9(B) Je’|. But because B changes from iteration to iteration the require- 
ments of the proof are more complex. These comparisons are helpful in 
comparing the convergence rates of the methods. 

Chazan and Miranker ods developed a notation for all linear iter- 
ative methods, which is used in the theorems showing convergence. First 
each new estimate is counted as an iteration, j. Then k. (3) the pa 
component of an ntl vector k(j), represents the age of the oe variable 
being used in the calculation of the new estimate of the K 4+] Gone vari- 


able. (Age is the number of past iterations.) With this notation the 


linear methods are expressed as: 


— j-k, ee See ils CaS 
al 8 \¥y IS mia | Yeo Vi4y She Nee Se Yn 
and — : 
g7k, (i) ea eae 
- ay y; i + ay, slp Rol Kae GD 
Jv” = 
ag 
ys" ae a ka D erie LO 


With this notation the Gauss-Seidel method is represented by k, (9) = 0 


Mer all i, j, and k (j) = (j-D mod ntl. 


Tab. 
The Jacobi method is represented by k, (9) = k, (4) =... 5 kD = 
a ee sD +1. 
(4 1) mod n, and K+ bP (i-1) mod ntl 
For the Chaotic Relaxation method k, (j) can take on any values, but 


to show convergence k,(j) < S for all i, ji and ka) () = {= 12 


infinitely many times. 
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oa 

Theorem II.B.1l Let AcL(R") such that A is an M-matrix. Then the se- 
quence of estimates y™ converges for II.B.10 and for a such that 
0<a < 2/(1+p(B)). [32] 

Thus for linear sets of equations, the Jacobian J, is simply the 

— : . ; ca : =I : 
coefficient matrix, and since if A is an M-matrix A exists so the true 
Newton method results in direct solution. Further since the block diagonal 
is a regular splitting the SOR iterations of the Newton-SOR method converge. 

For linear sets of equations the convergence rates in decreasing order 
are Newton, Newton-SOR, Gauss-Seidel, Chaotic, and Jacobi methods. 

For nonlinear equations the notion of diagonal dominance is conveyed 

n 
by the following two definitions. A mapping G:DCR- R" is off diagonally 
; : ne : 1 j I 

antitone if for any x¢R the functions q tte R | xite eD}>R ; p; , 6) = 
g.(x+te”) mej, | = le2,3,-.-., Neame antitone;and G is diagonally isotone 
, n : 1 i i 
menor any xeR the functions »,,:{teR [x+te eD}; m,,(t) = g. (xt te ) 
i= 1,2,...,m2 are isotone. ({e} is the orthonormal basis for R” such that 


m= (0,0,..., 0,1,0,...,0)° 


La 
The mapping G: DC Re Ss R is an M-functions if G is inverse isotone 
and off diagonally antitone. Further, the Jacobian of a set of M-functions 


is an M-matrix. With these definitions and the notation of Chazan and 


Miranker, convergence for nonlinear sets of equations can be proven. 


Theorem II1.B.2 Let G:iDCR >R bea continuous, off diagonally antitone 
and strictly diagonally isotone. Suppose for some veR™ there exists 

x, yeD such that i < ae J = { xeR"| ‘ eo ae D; oe Svs cy” (a). 
Then for ce(0,1] and any sequence la Je [e,1] the iterates ed and Ly} 


0 0 
given by the solution of II.B.10 (b) starting from x and y , respectively, 


are uniquely defined and satisfy 





31 


oo elites 5 oe 2 oe = oe 
GxI<v<cy! ete (c) 
as well as 
; & * ‘ * * 
lim x =x < y = lim y? Gx Svs Gy (d) 
Pe Pe 


Proof: The proof follows [35] and proceeds by induction. Suppose for 
some j 2 0, ize 0 


0 


0 e e e e 
omy < y cx] < v < Gy! 


me (3. eee Sok (4 “ee 
xd ,(i-D) < wd KC) 3 k.(i) < yi k.(j-1) es) 


From (a) define the functions 


Jo j-k,_, (4) i-k, 4, 


y(v) ms g.( 2 siete ie cp eal bd Vs oe sat are, 
- . cgdck, Gd) j-k, G3) j-k,,, (4) 
B(v) 8. (¥} Li Y;.) i i ie Ys 43 ig 3 


are defined for v e(x, ye]. From (e) and the off diagonal antitonicity 


of G 


Bw) = yx“) = gcc!) sv, < 8,0) = 


NOR) Ge") (£) 
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By continuity and strict isotonicity of y and p, (f) implies the 


J 


e Aa ] “a id 
existence of a unique ee and YG for which 


i mee esl 2 ae | 


J 


S a 
where the relation x: 


me e 
= v3 is a consequence of the tonicity properties 


of G. Because a,ele 1] 


i ig i geapigeecos le erga 
y; 2 y; Te eee wa = ee = x; 
“oie a gee gs 
KX, 2 x TE x3 ta, 2 XE 


mamen shows (ce) hoids for i = 1,2,...,; n. Hence x) Ss oe s y?. 
From this it follows that 


: ae a eee 
jt JS ORE wo ORE 


g, Cy = 8. (yj) ere’ 3 28 | n i 


and 


peek sat apy, aj joke) = 
g(x )<s g. (x] 1 precy Ki geeey KD OD ) iF 


This completes the induction and proof of (c). Clearly then the limits 


exist. 


Now since 
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jt+1 wo 
J = 72 = 


i+] al ah 1+ e e e ° 
Pee < xe = -—— (x? a x) +x] <- = (xi* - oo) + y? 


al PF CO. i *i € i 
a] 
imply that lim x) =x. and lim om = ye : 
po 1 i - i i 


* 
Therefore Gx = v= Gy. 


Theorem II.B.3 Suppose the conditions of the above theorem holds: 
bet ks (4) and Gy) be given such that ks (3) = ke (4) i= 12 eee 
ii 
except for t when ky (3) < ke (5) (a). Then for 1 #2, ee ae 
Malte ey 
dD IG) 


are n 
1 


ae ens OPE Spee live: 
ages oO ee) < 2 Cl 


end ifv< age then 


fee 1 2 2 
jk Ci) j-k.(j) LSD, ik Gp 
8. ¢y} 1 5 eal I + s 8. (¥} 1 Nay ye n ) 
j ; a 32 
Seemed | Gl CG <<) 


ae ok ae 
Proof: From the previous theorem and (a) & tst’s xp tJ’. The off 


diagonal antitonicity of G yields (b) and the inverse isotonicity yields 
j 


(c). The same steps hold for vy 


This shows that for the linear methods discussed, the Gauss-Seidel 
method reduces the error in the estimates at the highest rate and the 


Jacobi at the lowest, with Chaotic Relaxation in the middle. 
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The Newton methods have different requirements for convergence than 


the linear methods. 


n 0 7 
Theorem II.B.4 Let G:DCR - Be where x e€D, G(x ) = 0, G' is continuous 


2 * 
at x and Gale's ) is nonsingular. Then the Newton iterations 


= etx") 17) 6c’) (a) 


* 
converge to x. Further, if there exists a y and p such that 
oil #110 

[|G¢x) - G(x )I1 s y{|x-x |] 


then the Newton method converges superlinearly and if G(x )hh#0 
Vhe R”, h # 0, then the Newton method converges quadratically. [38] 
For the Newton-SOR method the convergence of the SOR iterations as 


well as the Newton iterations must be proved. 


Theorem II.8.5 Let G:D< R" + R" be G-differentiable on D and if G is 
inverse isotone or an M-function, then for any x€D at which G'(x) is non- 
singular, the derivative is inverse isotone or an M-function respectively. [38] 
Theorem II.B.6 Je G:DCR > ee is an M-function on D then any subfunction 

is also an M-function. [35] Let G'(x) = B(x) - C(x) = D(x) - L(x) - U(x) 

be reguiar splittings of G where Bis block diagonal and D is diagonal 


then 


[Cla E(x) + aD (x) L(x) +0) ] 


H, Cs) 


H, (x) = [(1-a)2(k) wees ce 
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From these mappings if G (x) is an M-matrix then p CH, ) = p (H.) = Lebor 


ae(0, 1). Now let m represent the number of SOR iterations. 


Theorem II.B.7 If G'(x) is an M-matrix, then p (H m+1 


m 
Oo es p CH, ) <1 and 


as m>© the rate of convergence of the Newton-SOR iteration approaches 
that of a true Newton method. [38] 
Thus when all methods converge the rates of convergence is ordered increasingly 


as true Newton,Newton-SOR m steps,Newton-SOR 1 step,Gauss-Seidel, Chaotic, & Jacobi. 


Pee C Methods of Ordering the Equations 


The last section discussed ordering the equations to reduce the cal- 
culations required to solve for the Newton update. There are other methods 
of ordering which can also increase the parallelism inherent in the algo- 
rithms used for solving the dynamic simulation problem. These ordering 
methods make use of the sparsity of the equations, to reduce the coupling 
between blocks of equations. The blocks are then used as the primary method 
of assigning equations to the processors for parallel solution. There are 
two forms into which the equations can be arranged which have been found 
pertinent for increasing the effective parallelism. The first form attempts 
to arrange the equations so that the diagonal blocks have the fewest inter- 
connections. Ideally, the equations of one block would have zero coeffi- 
cients for all of the equations outside the block. But as the number of 
blocks is increased, some equations will exist with coefficients that can- 
not all be ordered into a diagonal block. This equation is then included 
with the block which minimizes the number of nonzero coefficients outside 
the blocks. These nonzero off block diagonal terms occur randomly through- 


out the matrix. This form of equations is called the near block diagonal 
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form. See Figure I1I.C.1 for a matrix representation of the ordered equations. 

The other ordering method also attempts to arrange the equations so that 
the nonzero coefficients are in the diagonal blocks or the last block 
column. The equations which cannot be ordered into this form are included 
in the last block, where nonzero coefficients can occur randomly in any 
column. 

These equations are known as the cut~set equations, since they cut 
the remaining equations into disconnected blocks. The equations of the 
diagonal blocks still have nonzero coefficients outside the diagonal 
block, but now these nonzero coefficients are grouped into the last block 
column. See figure II.C.2 for the matrix form of these equations ,known 
as the bordered block diagonal form. 

Ordering the equations to one of these forms can result in benefits 
beyond increasing tne parallelism. The concentration of the zero coeffi- 
cients greatly reduces the programming efforts normally associated with spar- 
sity programming. Furthermore, just as optimal ordering reduced the number of 
operations required to solve for the Newton update, these orderings reduce 
the number of operations required to solve for the SOR equations to iterate 
in the Newton-SOR method. The near block diagonal form also reduces the 
number of variables which must be iterated for the Newton-SOR method. 

Carre [44], proposed a method of ordering the equations to the near 
block diagonal form. The methods used to obtain the bordered block diag- 
onal forms can also produce a near block diagonal ordering. 

There are several methods of ordering the sparse set of equations 


to the border block diagonal form. A true optimal solution exists but 
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the computation required to achieve this ordering may not be worth the 
effort required. 

There are three methods of finding a bordered block diagonal form 
of the equations used in this thesis. Each is based on the graph model 
of the equations, with a node for each variable and a branch for each 
nonzero coefficient in the equation defining that variable. By finding 
the minimum vertex cut-set of the graph, the equations which should be 
ordered to the last block can be isolated. They are the equations which 
define the variables represented by the nodes of the cut-set. Once the 
cut-set is found, the other equations can easily be ordered to the dis- 
connected blocks. 

The algorithm of Ogbuobiri, et al f45], for finding the bordered 
block diagonal form of a set of equations appears to be the most efficient 
method. This algorithm and the others are presented in Chapter Five when 
the importance of ordering is discussed. The appendix demonstrates the 


performance of these methods on a test set of equations. 


II.D Parallel Execution of the Algorithms 


The purpose of this thesis is to examine multiprocessor computing 
structures for the execution of the dynamic simulation problem. The gen- 
eral theory fcr the parallel execution is to assign blocks of equations 
to each processor. The processors would then iterate their equations con- 
currently, sharing the required data between the processors. This section 
will prove that the algorithms can be iterated concurrently, and still con- 
verge to the same solution as serial iteration. The iterations are the 


only part of the solution which requires a large amount of time. The 
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initialization and other program set up tasks will not be considered for 
parallel execution. 

A general purpose computer is assumed to be available for the con- 
pilation and initialization and of the programs. The general purpose 
computer would also perform the formation of the network matrices, and 
accomplish the plotting and other output functions. The use of the 
general purpose computer for these functions does not mean that the func- 
tions cannot be performed in parallel,only it means that the necessary modi- 
fications for parallel execution have not been studied. The amount of compu- 
tation required for these operations does not appear to be sufficient to 
warrant parallel execution. 

The most practical method of integrating differential equations is to 
convert the equations to algebraic form and iterate the differential equa- 
tions with the algebraic equations. The conversion by use of the trapezoid 
rule of integration is included in the initialization and is not part of 
the parallel algorithms. The major iteration is within one time step, but 
the computation required at convergence before starting the next time step 
will be considered and included in the parallel algorithms. 

The amount of parallelism differs with each algorithm and with the 
structure of the problem. First the general linear iterative algorithms 
parallel form is shown, followed by the parallel form of the Newton methods. 
The exact specifications of the parallel algorithms is left until Chapter Three 
where programs to execute these algorithms are developed and analyzed. 

The linear iterative algorithms require the evaluation of a function 
defining each variable. The algorithms differ only in the sequence in 
which the variables are updated. The Jacobi scheme updates the variables 


only after a new value is calculated for ali of the variables. 
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Theorem II.D.1: Let B:R > R” be the Jacobi iterative. functions and ra 
the initial estimate. Then if the new estimate is found by: 


ry 
age =" le: 


yg) ty 11.8.3 
The sequence y* found by updating the Jacobi variables by a serial 
process and 5S found by updating the Jacobi variables by a parallel 
process, are identical for ail k. 

The proof is obvious because of the Jacobi method's application of 


mre correction at the end of each iteration. 


Because the parallel Jacobi algorithm uses the same data for all 
processors, the parallel solution must be synchronized. Convergence is 
determined by all processors through the use of control points in the 
parallel segments of the code. Ccntrol points are used to insure the re- 
quired synchronization between processors, and to modify the execution 
sequence. Information from all processors is required before the indi- 
vidual processor is allowed to pass through the control point. 

For the Jacobi algorithm all of the processors must be executing the 
same parallel segment of code. ‘The code performs three distinct operations; 
computing a new estimate of the variables; applying this estimate to the 
variables; and advancing the time step. Control points are inserted in 
the code between each of these operations. The control points force the 
processors to be idle until all processors reach the point. Chapter Four 
discusses the implementation of control points, and their placement in the 


parallel algorithms. 
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The Chaotic Relaxation algorithm was designed for parallel execu- 
tion. It evaluates the same functions as the Jacobi algorithm, 
but the new values of the variables are used as soon as_ they 
are calculated. For this reason, the Chaotic Relaxation scheme 
does not produce the same sequence of estimates as the Jacobi algorithm. 
It was shown [33] that whenever the Jacobi algorithm converges, the Chaotic 
Relaxation algorithm also converges to the same poiut. When over or under 
relaxation is used, the optimal acceleration coefficient will be different 
than the coefficient used for the Jacobi scheme. Normally the coefficient 
will be closer to unity for the Chaotic Relaxation. 

The Chaotic Relaxation algorithm allows the elimination of the control 
points between the calculation of the new estimate and its application. 
All of the aigorithms require the control point before the advance of the 
time step. 

Although the Gauss-Seidel Algorithm is very similar to the Jacobi 
and the Chaotic schemes, the updating procedure for the Gauss-Seidel algo- 
rithm makes it difficult to execute in parallel. A straightforward im- 
plementation requires the ability to skew the solution process. An example 
best demonstrates this problem. Suppose a three variable set of linear 
algebraic equations was to be solved by the Gauss-Seidel process. The 
functions to be evaluated would be identical to the functions associated 


with the Jacobi or Chaotic schemes. In matrix form the iteration equations 
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Because of the time of the updating,the actual equations are evaluated 


as: 


cia n 
it] _ itl, VL Li 
a dM Fo, 85M TYG Tee 2 
k=1 k=j+1 


In order to illustrate the skewing required the evaluation process is 
displayed in Table II.D.1. Time Verdiepl aed across the table and the 
blank areas indicate idle periods for processors. The delays resulting 
from, and the control required for,the skewed execution of the Gauss-Seidel 
scheme limit the gains achievable by parallel execution. 

The Gauss-Seidel Algorithm is executable in a parallel manner much 
more efficiently when the equations are in the bordered block diagonal 
form. This form allows the updating to proceed in a sequential order and 
still use multiple processors. The use of the bordered block diagonal 
form exposes parallelism at the block level. A processor can thus be 
assigned to update the variables of a block. To illustrate the parallel 
algorithm assume the equations are linear. The matrix of coeffi- 


cients, A, can be partitioned into the following form; 


Ay 1 0 e@6;36hU6€} Ala 
Aon G8 Ady 

A = : ; 3 II.D.3 
Aol An? Aun 


Now partition y into the same blocks, and let the subscripts represent 
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the blocks rather than the variables. The Gauss-Seidel algorithm could 


then be performed in the following three steps: 


k 
(a) Zz. = An YO) : mee renee 3 Tie oe4 
nh 
k+l ie. : _.k 
(b) y, (t) = we? z, - Ww) a) Paap: S 
i 
k+1 . __k k+l =k 
eee oe ty (t) + A ty (t)} - @- 1)sy, (ct) 


i = lowe n-l II.D.6 


The necessity of storing and retrieving the portions of the cut-set update 
in steps a and b only slightly increases the number of operations that must 


be performed. 


Theorem II.D.2: Steps (a) and (c) can be executed in parallel for all 


blocks, and still maintain the sequential updating of the variables. 


Proof: For induction assume the ae iteration has been achieved. Cal- 
culation of the z's requires only data available before any processor 
begins computation. (This is equivalent to the evaluation of the Jacobi 
functions.) Therefore the z's can be calculated in parallel. Now the 
second step, (b), must be performed. Evaluation of step three, (c), re- 
quires data from the final block, found in step two, and from the diagonal 
block. Because of the bordered block diagonal form, no processor requires 
data from other than the cut-set variables. Therefore step (c) can be 
executed in parallel. Now by updating the variables within a block se- 


quentially, the order of updating becomes first the cut-set variables, in 
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Order: then the diagonal block variables, in order. The fact that all 
of the diagonal blocks are updated in parallel does not matter since none 
of the new values are required outside the block until the cut-set vari- 
ables must be updated. 

Basically these algorithms have been modified to allow parallel ex- 
ecution rather than developing new parallel algorithms. This insures the 
convergence properties remain intact, with only the Chaotic Relaxation 
algorithms convergence rate depending on the ee of processors. The 
Gauss-Seidel algorithm still convergences at the highest rate and the 
Jacobi at the lowest, for linear iterative methods. However, the execu- 
tion time must also include the delays which result from controlling the 
parallel execution and from the sharing of data. 

Parallel execution of the Newton algorithms is much more difficult. 
The increased computation required for these algorithms both helps and 
hinders parallel execution. Parts of the algorithm are very easily ex- 
ecuted in parallel, such as the evaluation of the functions and solving 
for the elements of the Jacobian. The largest part of the computation, 
the solution of the Newton update, is very difficult to perform directly 
in parallel. 

Evaluation of the functions, G, and solution of the elements of the 
Jacobian, J, require only the previous values of the variables. This is 
equivalent to the computations required for a Jacobi iteration, where all 
of the data required for the operations is available before the evaluation 


begins. Therefore the following theorem is stated without proof. 


Theorem TI.D.3: The evaluation of the functions and the evaluation of 
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the elements of the Jacobian may be executed in parallel with the same 
results as serial execution. 

The remainder of the Newton algorithm involves the solution of the 
equation for the Newton update, J6= G. The linear iterative algorithms 
could be used to find 6 in parallel, but this should be avoided because 
of the time required for the linear algorithms to converge. Another 
prospect is to invert the Jacobian. Pease [54], proposed two parallel 
algorithms for this inversion. The schemes required the sharing of all 
the elements of the Jacobian and a high level of control. The methods 
are infeasible for a large Jacobian or a high order multiprocessor. 

There are two methods which simplify this solution considerably. The 
Newton-SOR algorithm overcomes the problem by avoiding the direct solu- 
tion of the entire set of equations. The other method uses the bordered 
block diagonal form of the Jacobian to allow parallel direct solution. 

The Newton-SOR aigorithm solves the block equations explicitly then 
uses these results to iterate the shared variables to convergence, yielding 
a true Newton update. The iterations use Chaotic Relaxation, so that the 
Newton-SOR algorithm can be executed in parallel. By ordering the equations 
in the near block sgronal form the number of variables which are shared 
and thus the number of variables which must be iterated, is minimized. 

As the number of shared variables increases, the time required by the 
Newton-SOR algorithm increases cubicly. For a small number of shared 
variables, the ability to solve the blocks directly increases the region 

of convergence and reduces the execution time compared with simply a 

linear iterative method to solve the Newton update equations. The parallel 


Newton-SOR algorithm has basically the same sparsity and iteration control 
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requirements as the Chaotic Relaxation algorithm. 

When the equations are in the bordered block diagonal form, the 
Newton update equations can be solved directly by block LU Factorization. 
As with the bordered block diagonal Gauss-Seidel algorithm, there are 
three steps required for solution. Because of the bordered block diagonal 
form the equations are already nearly in factored form. The initial equa- 


tions are: 
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The first step of the solution process is performed by al! processors 
other than the cut-set processor. These processors factor the diagonal 


blocks into lower and upper triangular parts: 


i U; 


(a) I a : oe eee oe II.D.8 


and use the factored form to eliminate the last block row. This elimin- 


ation results in the formation of a modification to the equations for the 


-1 -1 
cut-set update. To the cut-set block must be added Jai Ue, Les J. 
-1 -1 
di. 


and to the cut-set functions must be added Jani Us Eg, 


k 
: G. (y ). In the 


second step the cut-set processor adds in the modifications by: 
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The cut-set processor continues the second step by solving Zz = Q for 
the cut-set updates. After the conclusion of the solution process, the 
non-cut-set processors can solve for their variable's updates in the third 
step. In (c) the cut-set variable's updates are used in the back substi- 
tution calculations which yield the remaining updates. 

Grew. Peas -1.7'c.Gy i= 1,2,....n°l 22.D.12 

ia 1 | ger GU 0 ie 

Block LU Factorization when performed in parallel results in the processors 
being idle for periods of time. During step (a) the cut-set processor is 
idle and the others are active. Then in (b) the cut-set processor is 
active and the others idle. Finally the others are active again and the 


cut-set processor is idle. The ability to solve in parallel is shown by: 
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Theorem I1.D.4: The blocks of the Jacobian can be decomposed in parallel 
and the back substitution can be performed in parallel for the Newton 


iterative algorithm in bordered block diagonal form. 


Proof: Since the Jacobian is in the bordered block diagonal form, de- 
composition of the diagonal blocks requires only the data within the 
blocks. The last row can also be eliminated in parallel, but this requires 
the summation of the elements of the cut-set block. Back substitution within 
the blocks requires only the previously found values of the cut-set vari- 
ables, to solve for the variables within each block. Therefore, parallel 
solution of these two parts is possible. 

Again the Newton methods were not altered to arrange the algorithms 
into parallel form. This preserves the convergence characteristics of 
these algorithms, and the increase in the speed of solution will depend 
only on the delays resulting from the sharing of data between processors 
and the control of the execution. The true Newton methods provides the 
highest rate cf convergence, but the method requires the bordered block 
form for parallel execution. The Newton-SOR algorithm has a much faster 
execution rate if the equations are ordered to the near block diagonal 
form, but neither the convergence properties nor parallel execution de- 


pends on achieving a block form. 
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II.E Sparsity Programming 


Sparsity programming has been discussed in several sections. It 
can be defined as a programming technique to be used to avoid computations 
whose resuit is known to be zero before the computation begins. Sparsity 
must be considered in all programs which require matrix like operations 
on a large number of variables. The solution time of the dynamic simu- 
lation problem can be reduced through use of sparsity programming. 

Sparsity programming began as methods to try to fit a large sparse 
matrix into a small core memory. Later it was realized that a large 
amount of computation could also be saved. The savings in memory and 
execution speed vary with the number of nonzero elements. Sparsity tech- 
niques usually require at least three memory locations to store a coeffi- 
cient, whereas nonsparse methods would require only one. For sparsity 
techniques to save execution time, an even higher ratio of zero elements 
is required. 

There are three basic sparsity programming techniques. [33]. The 
first is to use an index array to point to nonzero elements. Another is 
to use linked lists. The third is to use linear code. The first two 
methods are primarily to save momory, but the linear code increases the 
storage required. Linear code greatly reduces the execution time and 
uses secondary storage efficiently. Linked lists are used when the exact 
number and location of the nonzero elements is unknown prior tc execution. 

Linked list techniques are used to allow the addition and deletion 
of nonzero elements. Typically there are two arrays. One is the list of 
the beginning of each row, the other the list of the column numbers and 


the values, followed by the memory location of the next element of the 
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row. By changing the memory location of the next nonzero value, an 
addition or deletion to the list can be made. Quite often the pro- 
grammer needs not only the next element of a row but also the next 
element of the column. With the addition of an array pointing to the 
start of each column and two addresses after each value, the matrix 
could be stepped through by rows or columns. The additional address 
would point to the next element of the column. 

Linear programs are created by a special compiler, which unwinds 
the "DO LOOPS" normally used in matrix operations to a linear string of 
instructions. With the linear code, the coefficients are treated as 
single variables rather than as matrix elements. Unwinding the "DO 
LOOPS" requires a great amount of code, usually much more than is saved 
Dy the other sparsity prdgramming methods. The advaiitage is that only 
a small amount of this code is required in memory at one time. The 
reduction of core required results from moving sections of the code in 
and out of main memory as the sections are needed. Since most modern 
computers are capable of moving these sections of code without interrup- 
ting the execution stream of the computer, an overall savings in execu- 
tion time is achieved. 

The last method to be discussed is similar to the linked list, 
except that the location of the next nonzero element is always the next 
memory location. By always using the next memory location, the address 
of the next nonzero element can be omitted. This saves approximately 
one third of the linked list memory requirements. However, it requires 
knowledge of all the nonzero elements before the computation begins. 


Additions or deletions to the list require a complete relisting of the 
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nonzero elements. This is still less work than is required for the 
linear code method. 

More general programming techniques which might be included under 
Sparsity programming are the ordering methods discussed in section II.C. 
By ordering the coefficients to biock diagonal or bordered block diagonal 
form, several smaller more dense matrices may be stored rather than the 
large sparse matrix. This allows the omission of a large percentage of 
the zeros without the other expenses associated with sparsity pro- 
gramming. 

Quite often a combination of these techniques is used. For the 
programs of this thesis, the near block diagonal and the list methods 
will be combined. The diagonal blocks will remain as matrices, but the 
off diagonal eléments will appear as single variables. I€ is expected 
that with the bordered block diagonal form, the submatrices will provide 
a sufficient reduction in the zeroes so that other sparsity methods will 
not be required. 

For the multiprocessing environment, sparsity programming maps di- 
rectly into the reduction in sharing of data. It would be extremely 
wasteful to access a shared variable only to multiply it by a zero co- 
efficient. The sparsity which exists in the dynamic simulation problem, 
allows the equations to be ordered into the block forms and reduces the 
sharing of data between the processors executing these blocks to a small 


value,so that paraliel processing is an efficient solution process. 
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CHAPTER III 


ANALYSIS OF PARALLEL ALGORITHMS 


In section II.D. it was shown that the algorithms required for the 
dynamic simulation problem may be stated in a form suitable for parallel 
execution. It is still questionable whether it is practicable to execute 
these algorithms in their parallel form. This chapter analyzes the al- 
gorithms in detail to determine the requirements for and advantages of 
parallel execution of the dynamic simulation problem. 

In this chapter, the algorithms for the simulation of dynamic systems 
will be shown to generate far greater savings in execution time than costs 
from complexity. The major costs to be examined are the requirements for 
sharing data and the control required for parallel execution. The advan- 
tage is an increase in execution speed. 

Parallel execution of algorithms requires the exchange of data between 
processors. The dynamic simulation problem can be arranged to a form with 
a high degree of parallelism because only a small amount of data must be 
shared. Costly additional hardware is required for each piece of data 
which must be shared concurrently. If only one path is provided for sharing 
data, then quite often a processor must sit idle waiting its turn to use 
this path. With the dynamic simulation problem one path is all that is 
required for a high order multiprocessor. 

Traditionally, multiprocessors have taken advantage of the ability of 
multiple banks of memory to provide data faster thanthe processor could use it. 
In this way several processors can share the same memory space. Delays for 


sharing memory result only when two or more processors require a memory 
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location in the same bank. Sometimes when more than one processor 
requires the same memory location, it is because the processors are 
executing the same instructions, and the memory location contained the 
next instruction. This contention over the instructions is easily 
solved by providing multiple copies of the frequently executed segments 
of code. 

Although both processors and memories have increased in execution 
speeds, the cost of the increase in memory speed has been disproportionate. 
For most commercially available processors it is possible to buy memory 
faster than required. However, the cost of this memory is many times 
more expensive than memory which operates at the speed required by the 
processor. Since multiple copies of the programs are required, two levels 
of memory can be used to avoid the use of expensive high speed memories 
(faster than the cycle time of the processor). The local level of memory 
contains the information (program and data) required only by its associated 
processor. The second level provides for the sharing of data required by 
One of the other processors. By using shared memory for only the data 
which must be shared, a large number of processors can retrieve the needed 
data with little delay. 

This simple look at the multiprocessor structure is required for the 
development of the algorithms. Chapter IV presents the details of and 
comparison of different structures. 

The programs to be presented in this chapter are actually the parallel 
seements of the algorithms. Except where indicated each segment is executed 
by a different processor. Unlike the algorithms of Chapter II, these pro- 


grams show the entire requirements for each processor, rather than just 
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the procedure within a single time step. The programs include the control 
points showing the number of times the processors must be synchronized. 

The actual methods of achieving synchronization will be left until Chapter IV. 
Synchronization forces data to be shared within sections of the code rather 
than throughout all of the operations. The parallel segments of code es- 
tablish the rates with which data must be shared. 

In this chapter the advancement of the time step upon convergence 
of the equations is included in the analysis. 

The increase in execution speed of the parallel algorithms is based 
on the ratio of the longest parallel instruction stream compared to the 
single processor instruction stream. The delays which result from the 
parallel execution must be added to the execution time for the longest 
stream. The speed advantage will never be n-fold for n processors because 
of the delays resulting from sharing data and achieving synchronization. 
The algorithms are designed to achieve the highest possible increase in 
execution speed. The delays depend on the multiprocessor configuration 
and control structure, and as such are presented in Chapter IV. 

The input of data and the output of results is not discussed. Even 
though this is an extremely important part of the solution process, it 
requires almost no execution time. There are many direct memory access 
devices which perform these functions without interrupting the execution 
of the processor. Another possibility is to assign I/0 to the controller, 
which can retrieve the required outputs as they are exchanged between proc- 
essors. Since a General Purpose Computer is considered available to com- 
pile the multiprocessor program, it could also handle the I/0 conversion 


and plotting requirements. 
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With dynamic simulation, the sparsity of the problem being solved 
affects the rate at which data must be shared and the number of operations 
the programs must execute. When a specific problem is required to com- 
plete the analysis of data rates and execution time,one block of the 
Commonwealth Edison high voltage distribution system is used. By using 
an actual system, some assurance is gained that the results are "typical" 
of all systems. The analysis uses typical blocks from the decomposition 
algorithms of section II.C. A detailed description of the system and all 
the blocks of the decompositions are in Appendix A. No attempt is made 
to use any parallelism below the block level. 

All of the data derived in this chapter are based on the number of 
Operations each algorithm requires to advance the time step and the number 
required to perform one iteration within the time step. Since it is diffi- 
cult to compare the convergence rates of the different algorithms the over- 
all increase in execution speed possible with the different algorithms is 
based on the experimental results that the Newton iteration converges from 
five to seven times faster than the linear iterations. [58] 

The analysis in this chapter is used in Chapter IV to propose actual 
computing structures and to predict the performance of each structure. 
Analysis of the algorithms shows only two classes of multiprocessor 


structures are required. 


III.A. The Parallel Jacobi Algorithm 


The updating process of the Jacobi Algorithm requires the evaluation 
of a function for each variable. After all functions are evaluated, the 


correction (solution of the function) is added to each variable. AV 
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functions include those functions assigned to different processors. This 
is the first point in the Jacobi algorithm where the controller must exert 
its power to insure all processors are synchronized in the instruction 
stream. After evaluation of all the functions the correction to the 
variables must be applied and tested for convergence. If convergence is 
achieved by all processors for all the variables, then the controller 
causes the advancement to the next time step. If convergence is not 
achieved then the processors repeat the function éjatoteadtts based on 
the new values of the variables. Again the processors must be synchro- 
nized to insure this evaluation uses the new values of the variables. 
The three control points required for the proper execution of the Jacobi 
Algorithm divide the algorithm into three distinct solution steps; the 
evaluation of the functions, the correction of the variables, and the 
advance to the new time step. 

The other difficulty of the parallel algorithms is sharing data. 
The Jacobi algorithm shares data in all three steps. First the evaluation 
of the functions requires the value of variables from outside the block 
of variables being updated by the processor. Next when the variables 
are being corrected, the block variables which are required by other 
processors, must be stored in the shared memory. (They are also stored 
in local memory since they are required at high frequency by the local 
processor.) And finally after the variables have converged, the time 
step is advanced and the new predicted value of the variables required 
by other processors will be stored in shared memory. The sharing of data 
is restricted by the control points. The control points also require 


information from the other processors. 
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The instructions for the parallel segments of the algorithm appears 
to be just a smaller problem using the Jacobi algorithm. The only visi- 
ble difference is the control requirements. Basically for the parallel 
algorithm the number of operations required is erent the total number, 
where n is the number of processors. The reduction in sharing by grouping 
the variables into the near block diagonal form prevents the exact di- 
vision into equal parts. The fastest execution is achieved when the 
@ifferences in block size are minimal. Except for the delays resulting 
from sharing variables and the delays from synchronizing processors the 
parallel Jacobi algorithm is n times as fast as the single processor 
algorithm. 

Further analysis of the parallel algorithm requires a detailed list 
of the instructions necessary for execution of the algorithm. The in- 
structions required for one of the parallel segments is given in Appendix B. 
Only the instructions for one generator are shown in the block because the 
differences between generator models is small. More generators would re- 
guire duplication of that part of the instructions. This is not the most 
complicated model of a generator nor is it the simplest. It is the sim- 
plest possible model still representing all the different parts of a 
generating station. Since each block normally contains more than one 
generator, the number of operations required for the generator variables 
is multiplied by the number of generators. The network equations are 
regular so they can be executed by matrix operations. Sparsity is con- 
sidered more for reducing execution time than the reduction of storage, 
but some storage is saved. The actual location of the use of shared 


variables requires knowledge of the block of variables. One typical 
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block from the Commonwealth Edison system is shown in Table III.A.1. 

The actual values of the coefficients are only useful in predicting 

the rate of convergence of this model compared to another model. They 
provide no information on the convergence of this algorithm compared to 
another algorithm. The symmetry of the network equations allows the 
delineation of the block variables which are required by other proc- 
essors. Any equation which requires an external variable for its evalu- 
ation, defines a local variable which will be required by another proc- 
essor. 

From the instruction list and the table of sparsity the actual 
number of operations required for execution can be developed. Table III.A.2 
summarizes the operations in general and for this specific block. The 
variables for the general case are: 

NGEN is the number of generators of the block. 
NROW is the number of network variables. 
NNZRO is the number of nonzero coefficients. 
NVS is the number of variables required from outside the 
block (number of variables shared). 
NLS is the number of local var abIee required outside the 
block. 
All of these variables are easily found for any specific block of a 
model. More general variables are: 


ITS is the number of iterations for convergence within the 


time step. 


INS is the number of memory accesses for machine instructions. 
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Summary of Required Operations for the Jacobi Algorithm 


For the Generator Equations 


Memory Divisions Multiplications Additions 
Accesses 
function evaluation 141 0 4] 36 
variable updating 143 0 39 39 
advance time step 144 0 40 28 


For Network Equations 


function evaluation 18*NROW+ 4*NGEN+ 8*NGROW+ 7 *NROW+ 
}O*NGENT 2HAMaL AL 8NGEN+ Bo MGEN-- 
1LO*NNZRO *«NNZRO 5*NNZ RO 

variable updating 18*NROW 0 4*NROW 5*NROW 


For the Designated Block 


Generatcr 

function evaluation i923 0 141 106 
Network 

function evaluation 952 56 408 438 
Generator 

variable updating 429 0 117 117 
Network 

variable updating 396 0 88 110 
Advance time step 432 0 120 84 


Rae Tee 
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From Table III.A.2 expressions for the usage rate of shared variables 
can be developed. Without considering the control points, the access rate 


of shared memory would be 


ITs*{ 2*NVS +2*NLS ] + 2*NLGVR 





PETA. 1 
ITS* [ 296*NGEN + 36*NROW + 10*NNZRO+INS ] + 144*NGEN + INS 


This is the ratio of the shared variable accesses compared to the total 
number of memory accesses. The Jacobi algorithm requires synchronization 
S@emcairterent points of the instruction stream. This means III.A.L 

is incorrect and the ratios must be compared within the separate steps of 
the solution process. The first step is the evaluation of the functions. 
The reguirement for shared vertables comes from the function requiring 


external variables. The ratio of shared variables during this period is: 


2*NVS 
PEE AZ 





153*NGEN + 20*NROW + LO*NNZRO + INS 


The next step in the solution process is the variable updating period. 
Here the values obtained in the previous period are used to correct the 
variables. The access to shared variables comes from the updating of 
local variables required by external processors. The ratio of shared 


accesses during this period is 


2*NLS 
III.A.3 





143*NGEN + 20*NROW + INS 
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The final step of the instruction stream is the advance to the next 
time interval. This section of code is only entered once every ITS 
iterations. Delays in this section of code have much less effect on 
the overall increase in execution speed. The only shared variables in 
this section are the generator voltages that are required by external 
processors for function evaluation. The ratio of shared access during 


this final step of the iteration process is: 


2*LGVR 
a ae III.A.4 
144*“NGEN + INS 


These ratios indicate the very low rate with which data must be shared 
between processors. ‘ihe rate depends on the sequence in which the in- 
structions are executed. If the acceleration coefficient were applied 
during the function evaluation, the ratios would be significantly differ- 


ent. The ratio for the function evaluation is 


2*NVS 
Hoe TIT.A.5 
263*NGEN + 32*NROW + LO*NNZRO + INS 
and the ratio for the variable update is 
2*NLS 
III.A.6 


33*NGEN + 8*NROW + INS 


This coding method is inefficient because the rates required for sharing 
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data between processors vary to such an extreme. The multiprocessors 
would have to be able to supply the data at the higher rate even though 
this rate is required only for a short period. In the coding of the 
algorithms the rate of sharing data will always be spread as evenly as 
possible throughout the program. 

To estimate the actual rates required the values of the variables 


for the Commonwealth Edison System are as follows: 


NGEN = 3 
HROW = 22 
WNARO = 52 
Mmyise = 12 
MLS = 9 
[GVx = I 


When the instructions are executed on the CDC6400, the number of machine 


instructions for each section of the program is as follows: 


INS FCN EVAL = _ 154*NGEN + 80*NROW + 8*NNZRO 
INS VAR UPDI = 124*NGEN + 8*NROW 
INS TME ADV = 130*NGEN 


The most crucial section of the program for sharing data is the variable 
updating procedure. For the Commonwealth Edison System during this section 
approximately 1.3% of the accesses would be for data within the shared 
memory. The function evaluation requires the lowest access rate or 0.08% 
of the accesses to shared memory. However, if the code was not properly 
Ordered and the acceleration coefficient were applied during function 
evaluation, the shared memory access rate would be 3.6%, or almost three 
times the rate of the proper code. This demonstrates the importance of 


careful coding of the parallel algorithms. 
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III.B. The Chaotic Relaxation Algorithm 


The Chaotic Relaxation Algorithm was designed to be a parallel al- 
gorithm. It evaluates the same functions as the Jacobi Algorithm to 
compute an update for each variable. But, rather than waiting until all 
updates are computed before they are applied, the Chaotic Scheme corrects 
the values of the variables as soon as a new value is predicted. This 
means that the values used to compute the correction may use the estimated 
values from different iterations, In fact, it is an advantage that the 
Chaotic Algorithm uses the latest possible value without regard to which 
iteration produced that estimate. The correction procedure repeats until 
all of the variables converge. 

The Chaotic Algorithm depends on the number of processors. If the 
number of processors is equal to the number of variables to be updated, 
then the Chaotic and Jacobi Aigorithms are identical. At the other limit, 
if there is only one processor, the Chaotic Algorithm is identical to the 
Gauss-Seidel algorithm. For the number of equations in the power system 
model, it is infeasible to have a processor for each equation, and the 
execution rate is not increased with just one processor, so the equations 
are grouped into blocks with each block assigned to a processor. To 
minimize the sharing of variables between processors, the equations are 
ordered to the near block diagonal form. With the Chaotic Scheme's inde~ 
pendence from iteration numbers, as soon as a processor completes updating 
the variables of the block assigned to it, it can immediately restart the 
updating procedure. There is no requirement to synchronize the processors. 
The deletion of the synchronization required in the Jacobi Algorithm re- 


sults in the increase of the efficiency of the processors. 
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This approach is different from that of the designers of the algorithm. 
They intended the algorithm for a general purpose multiprocessor with 
totally shared memory. In the original form each processor would update 
all of the variables, as if it were the only processor. Actually, several 
processors would be updating the variables, each remaining approximately 
equidistantly separated in the list of variables. If the blocks of vari- 
ables required equal computational time, these two methods of applying 
the Chaotic Relaxation Algorithm would be equivalent. With different 
computation times for each block, the convergence rate depends on the 
least frequently updated variables. 

The instructions for the Jacobi and Chaotic schemes are identical, 

with only the order changed. However, the application of the update, as 
soon as it is computed, slightly reduces the total computation. The real 
Savings in computation time comes from the elimination of the synchroni- 
zation. The synchronization at the advance of the time step remains be- 
cause Of the requirement of convergence of all variables. Because of the 
updating during the evaluation, the access rate of shared memory is not 
restricted to the short interval of variable update as in the Jacobi. 
This results in a much lower rate of use of shared variables. The re- 
arrangement of the instructions into the Chaotic Relaxation Algorithm is 
shown in the appendix. The same block variables (Table III.A.1) are used 
when sparsity is used for the prediction of access rates. A summary of 
the operations required for the execution of the algorithm is given in 
Table III.B.1l. 

Using the variables presented in the previous section the requirements 


for the access to shared memory for the instructions within a time step 





6/7 


can be expressed by the following ratio: 


2*NVS + 2*NLS 





JEN Ba bea 
290*NGEN + 36*NROW + 10*NNZRO + INS 


After convergence the advance of the time step is identical to the Jacobi 


Algorithm and results in the following ratio of access rates: 


2*LGVR 
ITILI.A.4 
144“NGEN + INS 


The combination of the two sections of code from the Jacobi algo- 
rithm results in reducing the shared memory access rate to 0.8% for the 
Commonwealth Edison System. This is lower than the rate required for the 
time step advance, but so seldom is this code executed that the smaller 
value should be used. 

From the shared memory usage rates established in these two sections, 
and from the convergence rates shown in the previous chapter, the con- 
clusion is drawn that the Chaotic Scheme is preferable to the Jacobi 
method for multiprocessors. Since the requirements of these two algorithms 
on the multiprocessor structure are identical only the Chaotic Relaxation 
Algorithm is studied further. 

The execution speed of the Chaotic Algorithm is important for com- 
parison with the other algorithms. Comparison is much more difficult with 
the other algorithms, than with the Jacobi algorithms because of the dif- 


ferent requirements the algorithms make on the multiprocessor, the differ- 
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ences in the regions of convergence. By combining rate of convergence 
and execution speed, general comparisons are possible. However, only 
experience with a wide variety of problems will be conclusive. 

The measure of execution speed is based on the number of memory 
accesses required for one iteration of the algorithm. This value can 
easily be obtained from the information contained in Table III.B.1. 


The Chaotic Algorithm requires the following number of memory references: 


625*NGEN + 89*NROW + 18*NNZRO EL OD. 2 


For the Commonwealth Edison System block of Table III.A.1. approxi- 


mately 5800 memory references are required for one iteration. 


miL.c The Bordered Block Diagonal Gauss-Seidel Algorithm 


The Gauss Seidel Algorithm is a linear algorithm like the Chaotic 
and Jacobi Algorithms. It uses the same functions to compute the updates 
for each variable. The algorithm differs from the Jacobi in that the up- 
dates are applied as soon as computed. Unlike Chaotic Relaxation, the 
Gauss-Seidel Algorithm updates the variables in strict sequential order. 
The immediate updating of the variables is not difficult with parallel 
processing. However, the strict sequential order with which the update 
is computed requires a high degree of control to implement. The difficulty 
of sequential updating can be avoided by ordering the variables to the 
bordered block diagonal form. With this form there is parallelism even 
in the sequential updating. After the cut-set variables are updated, the 


other blocks can be updated in parallel without destroying the sequential 
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order. This results from the lack of dependence on the variables in a 
block on the variables of another block other than the cut-set. By 
simply updating the variables of a Biteek sequentially, the entire list 
of variables is updated sequentially. This divides the algorithm into 
three parts: the updating of the cut-set variables, the updating of 
the other variables, and the advance of the time step. 

An unexpected advantage of the bordered block diagonal form is the 
method by which this form shares data. The only requirement for the 
sharing of data is between the cut-set processor and another processor. 
Only two processors need share a piece of data. Of course the cut-set 
processor must share data with each other processor. Each processor 
must provide different information to the cut-set processor, but it 
provides exactly the same information to each other processor. By 
providing a separate shared memory between each processor and the cut=-set 
nrocessor, Only two processors will have to share the same memory. To 
the cut-set processor, part of this shared memory could appear as a single 
shared memory. The cut-set processor could store its variables for all 
processors with a single instruction per variable value. With only two 
processors sharing a memory, the contention over that memory can be made 
inconsequential. 

The lack of parallelism in the correction of the cut-set variables 
can be overcome by several methods. The simplest is for each processor to 
compute its portion of the update. The last block row depends only on 
the block variables, and could be solved by the block processor by sharing 
the portion of the update rather than the variables required to compute 


that portion. The cut-set processor would have to sum these partial 
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corrections with the correction for its block. This still leaves the 
Other processors idle for the final steps of the cut-set variable 
correction, and the cut~set processor idle while the other variables are 
being corrected. This implementation provides some parallelism with 
only a small degree of control. The control synchronizes the processors 
after each stage of the program to insure completion of that section of 
the program. The processors require synchronization after the separate 
portions of the correction to the cut~set variables, and after complete 
correction to the cut-set variables. Synchronization is always required 
for the advancement of the time step. 

By increasing the control, the idle time of the processors can be 
lowered. First, assume the representation of numbers by the processors 
allows a symbolic representation of a number which would not be encoun- 
tered in normal computation, say minus infinity. To start the algorithn, 
fill shared memory with this number. Then, 2llow the processors to begin 
to compute their portion of the cut-set correction. After a correction 
is computed it replaces the minus infinity stored in that location. Thus 
the cut-set processor can also begin to compute the correction. If when 
accessing shared memory for a portion of the update, a minus infinity 
is encountered,the cut~-set processor would wait for the other processor 
to compute the correction. The added testing occurs during time the 
processor would be idle. After the cut-set processor successfully reads 
a portion of the update, it refills that location with minus infinity for 
the next iteration. Likewise, after the other processors have computed 
their portions of the update, they can begin to correct their variables. 


The cut-set processor will update its variables and store them in shared 
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memory, replacing the minus infinity stored there. If another processor 
finds a cut-set variable equal to minus infinity it will know that the 
cut-set processor has not yet corrected that variable. The processor 
then waits until the new value for that variable is stored. The last 
time a processor uses a variable, it replaces that variable with minus 
infinity. The major idle periods have been removed, and the parallelism 
has been extended at slight cost. The cut-set processor is still idle 
for a large part of each iteration but it does not delay the other proc- 
essors as greatly with this form. If the other processors had generator 
and/or load models included in their variables, but the cut-set did not, 
then the other processors would seldom have to wait for cut-set variables. 
The generator model solution time would more than cover the cut-set vari- 
ables updating. 

Another possibility, if the simulation includes extensive generator 
and load modeling, is to have the cut-set processor update the cut-set 
variables with no help, while the other processors update the generator 
and load equations. (This prevents the cut-set variables from including 
these variables, but this does not restrict the cut-set choice.) The 
other processors would then share the value of their variables rather 
than the portion of the cut-set update. With the same use of minus in- 
finity, the correct value of an iteration would be insured. To make this 
scheduling feasible, the number of computations for the generator equations 
must cover the computation of the cut-set update. This requires, on the 
average, at least three generator equations for every cut-set variable. 
Again the cut-set processor would be idle for part of each iteration, but 
this is the fastest method per iteration when the generator equations cover 


the cut-set equations. 


—E———— - 
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The same basic instruction sequence is used for each implementation. 
Appendix B contains a listing of this code. For the first implementation, 
synchronization would occur after the first block of network equations, 
and again after the second block. For the other implementations these 
synchronizations do not occur. For the third implementation the cut-set 
processor executes all the instructions required to correct the cut-set 
variables. This eliminates the loop of code which computes the partial 
updates. Further the loop of code for correcting the cut-set variables 
is identical to the code required by the other processors for their net- 
work equations. The number of operations for each of these sections of 
code is given in Table III.C.1. From this table the summaries for all 
the implementations of this algorithm can be derived. 

Since the sharing of data is not a restrictive problem with the 
bordered block diagonal form, the critical information is the speed of 
execution. The Gauss-Seidel Algorithm is known to converge faster than 
the Chaotic Relaxation Algorithm. Comparison of the execution speeds of 
the two algorithms is difficult because of the differences in the require- 
ments made on the multiprocessor structure by each algorithm. A typical 
bordered block diagonal decomposition of the Commonwealth Edison High 
Voltage Distribution System is given in Table III.C.2. As with the near 
block diagonal decomposition of Table III.A.2, this decomposition is for 
a five processor structure. (For the near block diagonal form there are 
five diagonal blocks; for the bordered block diagonal there are four 
diagonal blocks plus the cut-set.) 

One method of measuring speed of execution is by counting the number 


of operations required for one iteration of the update of the variables. 
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From Table III.C.2 this is an easy task. For the first implementation 
with the code synchronized at the network equations, the number of memory 


accesses per iteration is 


625*NGEN + 30*NROW(Partials) + 18*NNZRO(Partials) 
+ 90*NROW(cut-set) + 18*NNZRO(cut-set) + 22*NPARS 


+ 83*NROW + 18*NNZRO + Synchronization delays LEPeort 


This is significantly larger than the number of accesses required for the 
Chaotic Scheme. The other implementations reduce the total required memory 
accesses, however, the delays become less predictable. The third imple- 
Mentation requires the fewest memory accesses per iteration, but it requires 
the compatation time for the generator variables to cover the computation 


time of the cut-set variables. The number of memory accesses required is 


625*NGEN + 83*NROW + 18*NNZRO 
Or 


83*NROW + 18*NNZRO + 83*NROW + 18*NNZRO 
cut-set cut-set 


ieee Cee 


whichever is greater. This third implementation, with the cut-set proc- 
essor updating the cut-set variables while the other processors update the 
generator variables, requires approximately the same number of memory ref- 
erences for the program parts as does the Chaotic Algorithm. One difference 


is that the generator and load models for the Gauss-Seidel Algorithm are 


_— ———=-~» 
— 
@ 
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divided among one fewer processor than they are for the Chaotic Algorithm. 
For the Commonwealth Edison System the different algorithms compare as 
follows. The Chaotic Algorithm requires approximately 5800 memory refer- 
ences and the first implementation of the Gauss Seidel Algorithm requires 
approximately 7000 memory accesses, but the third implementation requires 
only 5200 memory references. For the third implementation the four gen- 
erator models cover the updating of the cut-set variables. Since a com- 
pletely different multi-processor structure is required for these algorithms, 
this small speed advantage is not significant. Both algorithms are carried 


On into the next chapter for the design of multi-processing structures. 


TII.D. The Newton-SOR Algorithm 


The Newton algorithms are very different than the linear algorithms 
discussed in the previous sections. The Newton Algorithms develop a set 
of equations, the solution of which yields the new estimate of the variables 
of the dynamic simulation problem. The development of the equations and 
their solution requires many times the computation effort of the linear 
algorithms. However, this additional computation results in quadratic 
rather than linear convergence, and normally produces convergence when 
the linear methods fail. The additional computation divides the Newton 
Algorithms into four steps. 

The first step of the Newton methods is the solution of the function 
defining the variables. This function is similar to the functions used to 
predict the new estimates of the linear methods. The second step is the 
formation of the Jacobian. The Jacobian consists of the partial derivatives 


of these functions with respect to each variable, and is the coefficient 
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of the set of equations. Quite often the solution of the function is found 
at the same time the Jacobian is formed. Both of these sections can re- 
quire the value of any variable; sharing may be required. The Jacobian 
and the solution of the functions define a linear set of equations. The 
third step of the algorithm is the solution of these equations to provide 
the new estimate of the variables. Since the equations are linear, they 
ean be solved directly. It is the direct solution of these equations 
which requires the largest amount of computation, and is the most difficult 
section to arrange in parallel form. For the Newton-SOR Algorithm a so- 
lution to the Jacobian equations is found by direct solution of the diagonal 
blocks and iteration of the off-diagonal elements. Since the Jacobian ex- 
hibits the same sparsity pattern as the functions of section one, the 
sparsity from the Chaotic Algirthm carries over to this section. The 
difficulty of the Newton-SOR Algorithm is the iteration of the off-diagonal 
elements, each of which requires variables from outside the block diagonal. 
By using the near block diagonal form of the equations, the number of off- 
diagonal elements is minimized. A processor can be assigned to each block 
for parallel execution, by sharing the values of the variables of the off- 
diagonal elements. This requires a high rate of sharing data between proc- 
essors. The fourth step of the algorithm is the advancement to the next time 
interval. It is almost identical to this step in the previous algorithms. 
As with the other algorithms the variables are of two types, state 
variables and algebraic variables. The functions of the first part differ 
for the two variables types. For the state variables, the functions repre- 
sent the change in the derivative for this time step from the previous 


iteration to the present iteration. The solution has been found when the 
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derivative does not change from one iteration to the next. Calculation 

of the derivative is identical to the linear algorithm method. The code 

for this section, given in Appendix B, appears different because of the 

use of matrix representation of all of the variables, and the calculation 
of the Jacobian interspersed with the function evaluation. The functions 
for the algebraic variables are different from the linear algorithms. 

The functions are in the original form given, rather than the fixed point 
problem. The value of the function defines the error of the variables 

from a functional value of zero. For the power utilities simulations, 

the algebraic variables are further divided into current and voltage. The 
functions for the current variables define the difference between the cur- 
rent of the generator or load and the current of the node connecting to 

that device. The voltage functions define the imbalance between the nodes. 
In the previous algorithms the current equations are combined into the nodal 
equations for the loads. The voltage equations are similar to the equations 
of the previous algorithms. 

The evaluation of the functions oftenincludes calculation of the elements of 
the Jacobian. This canbe used to reduce the total computation by interspersing 
the solutions. The number of iterations required for convergence, may not 
increase significantly if the Jacobian is not recomputed at every iteration. 
Usually the Jacobian is recomputed and resolved only if the solution requires 
more than a few iterations, or if the step size or some other major variable 
is changed. (Such as the change due to the occurrence of a fault.) Because 
of the large amount of computation required to solve the Jacobian equations, 
the added iterations my actually result in a ceduction in computation time. 


After the Jacobian equations are solved once, the next solution can be ob- 
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tained in only a few operations. 

For the Newton-SOR Algorithm the solution of the Jacobian equations 
is not completely direct. Symbolically the inverse of each block of the 
Jacobian multiplies its block row and function. This process is 
totally parallel, requiring no information from another processor. If 
the Jacobian is not going to be recalculated after every iteration then 
the inverse must be saved for multiplication by the next functions. Com- 
putationally this procedure is inefficient. A matrix can be inverted and 
multiplied times another matrix in the same number of steps as it can be 
inverted. Computationally the block diagonal is decomposed into an upper 
and lower diagonal matrix. The new values for the off diagonal elements 
and the functions are found by solving the lower triangular equations then 
the upper triangular equations. The triangular form allows direct solution. 
The solution of these equations tends to fill in the off-diagenal columns. 
Columns of the Jacobian which were all zero remain such, but columns with 
One Or more nonzero elements tend to be completely full after solution. 
This is a result of the fill in of the matrix inverse. The fill in does 
not increase the number of variables which must be shared between processors 
but it does require every variable to depend on the external variables. The 
iterations to correct for the off-diagonal elements require only the external 
variables. These external variables depend in return on the iterated values 
of other external variables, including some of the local variables. But 
only a small number of the local variables are required by other processors. 
Only these variables must be iterated to convergence. The variables not 
required by other processors are simply corrected after the convergence of 


eer tne shared variables. 
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To show the fili in and near block diagonal form,the block of the Common- 
wealth Edison System of Table III.A.1 is repeated, with one of the sets of 
generator variables. In this version, Table III.D.1, the fill in due to 
the LU Decomposition is denoted by an F, the original elements by X, and 
the fill in due to inversion by I. The off-diagonal elements need all the 
values for the corrections. This shows the optimal ordering of the generator 
and network variables. 

In the actual computation it is easier to find the error in the previous 
es*imate than the actual new estimate. For this reason the last part of the 
Solution of the_Jacobian equations is the addition of the correction to the 
old estimate of the vac lanes The use of the error allows convergence 
testing before solving the Jacobian equations. 

The Newton-SOR Algorithm reqevtres the shzrirng of data in three of the 
steps. Data must be shared for the evaluation of the functions and for the 
solution of the elements of the Jacobian. The other period requiring the 
sharing of data is during the third step of the solution process. The iter- 
ation of the corrections due to the off-diagonal elements requires the values 
Of external variables. The later sharing rate is the critical rate. Control 
of the algorithm is required to insure that the evaluation of the functions 
and elements of the Jacobian use the new estimates of the variables. Con- 
vergence must be tested over all processors in two places. First, for the 
convergence of the variables, after the evaluation of the functions, and 
again for the convergence of the corrections due to the off-diagonal elements. 
All of the control points are shown in the program code. 

The sparseness of the functions is still used for their evaluation. 


Since the generator and other models of the block diagonal are themselves 
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block diagonal, it is possible to reduce the computation of the decomposition 
by sparsity techniques. The entire block diagonal matrix is stored, but 
arrays indicate the last element of each column and row to reduce the compu- 
tation. The off-diagonal values are compressed to successive columns. 

To determine the rates required for sharing data and the increase in 
execution speed, the summary of Table III.D.2 is provided. The variables 


used in that table and in the equations to follow are defined below: 


NVBLS Number of variables in the block 

LARU Length of the average row beyond the diagonal 

LACL Length of the average column below the diagonal 

NGEN Number of generator models in block 

NLOAD Number of load models in block 

NROW Number of network nodes in block 

NNZRO Number of nonzero Y matrix entries in block 

NEXV Number of nonlocal variables used in model 

NITS Number of iterations used for convergence of inner loop 
NLVR Number of the local variables required by other processors 
NSTV Number of state variables in the block 


From the summary of table III.D.2 the extremely high rate of sharing new 


values of the corrections during the iterations is shown. The ratio of 


shared memory accesses to all memory accesses is presented in III.D.1. 
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This is a favorable ratio only for a small number of external Vanbabihecs 
The ratio can be improved by sacrificing convergence speed. By using a 
Jacobi updating scheme rather than the Chaotic replacement, the external 
variables would only have to be accessed once a loop. This would lower 


the ratio to the value in III.D.2. 


(NEXV + 1) 


EEL ADe2 
(54 + 9*NEXV)*NLVR 


For the Commonwealth Edison System of Table LII.D.2, the Chaotic Replace- 
ment would require /.5% of the memory accesses to be to shared memory. 

For the Jacobi replacement only 1% would be required. This would allow 
more processors to share the same shared memory. The execution time would 
not be changed, but the convergence rate would be slower. 

For the Newton-SOR Algorithm the iterations due to the off-diagonal 
elements is almost the only time data that must be shared between processors. 
Shared data is required for the function and Jacobian evaluation and for the 
advance of the time step, but the rate during these periods is many times 
Slower than that required during the iterations. 

Also from the Table III.D.2, an estimate of the time required for one 
Newton iteration can be found. The number of iterations of the SOR loop 
is an unpredictable parameter which varies with problem and even with exact 
decomposition. The time in terms of memory eccesses is shown in 


me. D.3. 


TIME = 838*%NGEN + 107*NLOAD + 34*NROW + 40*NNZRO + 
NVBLS*{ 28 + LACL*(36 + LARU*12)] + 10*NEXV + 
15 + (NVBLS - 1)*{29 + NEXV*(20 + LARU*9) + LARU*9] + 


NITS* [NLVR*(54 + NEXV*9)] + NVBLS*(28 + NEXV*8) TileD.2 
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After the direct solution time is shown in the next section, a reasonable 
upper bound on the number of inner iterations required for convergence can 


be developed. 


III.E The Bordered Block Diagonal Newton Method Algorithm 


The bordered block diagonal Newton method follows the same four steps 
used by the Newton-SOR method. But as the bordered block diagonal Gauss- 
Seidel Algorithm differs from the Chaotic Algorithm, so the bordered block 
diagonal Newton Algorithm differs from the Newton-SOR Algorithm. The 
differences appear in the third step of the solution process, where the 
Jacobian equations are solved. By arranging the variables in the bordered 
block diagonal form, these equations can be solved directly, rather than 
iterated to convergence. The cost of the direct solution is increased 
computational requirements, idle time for processors, and more stringent 
sparsity requirements. As with the bordered block diagonal Gauss-Seidel 
Algorithm the unexpected benefit is lowered memory contention. 

Tne bordered block diagonal Newton Method requires decoupling the 
variables into blocks and a cut-set. The network variables provide all of 
the interconnections and form the entire set of variables which must be 
examined for decomposition. Since all other processors are idle while the 
cut-set processor is solving for the new cut-set variables maximum effi- 
ciency is obtained by minimizing the number of cut-set variables. Thus even 
the load models connected to nodes of the network are not included in 
the cut-set variables. These loads are included in the other diagonal 


blocks. 


The solution of the functions defining the error of the variables and 
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the elements of the Jacobian can require the values of the cut-set vari- 
ables as well as the local variables. But none of the other processors’ 
variables are required. The equations are identical to those required in 
these sections in the Newton~SOR Algorithm. The fourth section, advance- 
ment of the time step, is also identical to the Newton-SOR Algorithm. 

The third section of the program, the solution of the Jacobian equa- 
tions, is the only point where the programs differ. Since the Jacobian 
has the same eer As the functions, the Jacobian is in the bordered 
block diagonal form. To solve a set of equations the last chapter showed 
that the matrix of coefficients should be decomposed into the product of 
an upper and lower triangular matrix. The bordered block diagonal Jacobian 
is already very near this form because of the sparsity pattern. The diag- 
onal blocks can be decomposed in parallel so they are each of this form, 
leaving only the block row of the cut-set variables. When the other proc-~ 
essors eliminate these rows for the complete LU Decomposition, they modify 
the diagonal block of the cut~set variables. The cut-set processor must 
add in the modifications before this block can be decomposed. After de-~ 
composition is complete, the process of back substitution can begin. First, 
the back substitution of the cut-set variables is accomplished. Then the 
other processors can perform the back substitution for their variables in 
parallel. The completion of the back substitution yields the correction 
to each variable, which can be applied completely in parallel. After cor- 
rection the process repeats. 

The bordered block diagonal Newton Algorithm requires every processor 
to be able tc share data with the cut~-set processor, and the cut-set processor 


to provide the cut-set variables to every processor. It requires the cut-set 
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processor to be idle while the other processors are decomposing the Jacobian 

and performing the back substitution for computation of the correction. 

These other processors are idle while the cut-set processor performs the 

same computations on the cut-set. The idle periods provide all of the 

synchronization required by this algorithm. The only other control function 

required is the determination of convergence across all processors. This 

could be included in the duties of the cut-set processor with little diftueulty. 
The instructions required to execute this algorithm are given in Appen- 

dix B. The instructions for the cut-set processor are set off from the other 

instructions in their proper sequence. The bordered block diagonal form 

of the variables was given in Table III.C.2, however it is repeated here 

in Table III.E.1, so that the fill-in which results from the LU De- 

compositioi cdaii be designated. Tiié sutimdry of tlie OoOpérations is 


given in Table III.E.2. 


As with the bordered block diagonal Gauss-Seidel Algorithm the sharing 
of data is not difficuit with the bordered block diagonal Newton Algorithm. 
Since only two processors are required to share any one piece of information, 
the data rates need never be as high as for the other algorithms. This 
leaves the time required for execution as the major point of analysis. 


This time is given in terms of memory accesses in III.E.1. 


838*NGEN + 107*NLOAD + 34*NROW + 40*NNZRO + 


; + 3*NCSV) 


NVBLS* [30 + LACL*(38 + 8*LARU)] + NP*(5*NCSV 
+ 4*NCSV?/3 +44%NCSV- + 79¥NCSV - 19 +(NVBLS - 1)* 


[46 + 7*LARU + 7*NCSV] LiL 
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Both the bordered block diagonal Newton and the Newton-SOR Algorithms con- 

verge at the same rate, thus to compare their solution rates only the time 

required for one Newton iteration need be compared. Using the Commonwealth 
Edison System the ratio of memory accesses yields the execution rates of 


ITI.E.2. 


NSOR 81LO*NITS + 46600 + Jacobian eval 


BBDN 50000 + Jacobian eval 





Tees 2 


This suggests that 4 SOR iterations can be allowed before the Newton Algo- 
rithm will be faster. But the summation of the modifications to the cut- 
set block can be accomplished during the modification process by using the 
software flag as in the borderéd block diagéiial Gauss-Séitdet Algorithm. 


The ratio then reduces to the value of III.E.3. 


NSOR 810*NITS + 46600 
ree slo dyb: 


BBDN 37000 


This shows that the true Newton Algorithm should probably be used if 
the equations can be arranged into the bordered block diagonal form. The 
vast differences required in the multiprocessor structures which can execute 


these algorithms weakens the apparent superiority of the true Newton method. 


IIL.F Comparison of the Algorithms 


This chapter presented four algorithms which are suitable for parallel 
execution of the dynamic simulation problem. In the next chapter multi- 


processor structures are presented that are capable of executing these 
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algorithms. In this section the algorithms are compared to show the 
similarities that exist among them. It is shown that only two different 
types of multiprocessing structures are required. 

The most obvious similarity is that the Chaotic Relaxation and the 
Newton-SOR algorithms use the near block diagonal form of the equations, 
while the Gauss-Seidel and true Newton methods require the bordered block 
diagonal form. The form of the equations determines the type of data 
sharing that is required for the solution. 

The near block diagonal form algorithms share a small portion of the 
variables among all processors. Each processor uses one or more of the 
shared variables during the calculation of the new estimate of the local 
variables. However, since only data is shared, no processor will try to 
obtain access to shared memory on the cycle immediately after receiving 
the value of a shared variable. The Chaotic and Newton-SOR algorithms 
require access to shared data at different rates. The memory contention 
resulting from the use of shared memory is also different for these two 
algorithms. The Chaotic Relaxation algorithm requires access to the shared 
variables at a much lower rate than the Newton-SOR algorithm. The Newton- 
SOR algorithm requires one additional control point, but any processor 
which can efficientiy execute the Newton-SOR algorithms can efficiently 
execute the Chaotic Relaxation algorithm. 

The bordered block diagonal algorithms share solution information 
in an entirely different manner. With these schemes only the cut-set 
processor requires information from the diagonal processors, and the cut- 
set processor must supply information to the other processors. Both the 


Gauss-Seidel and the true Newton algorithms require the cut-set processor 
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and the other processors to alternate execution periods. Since the proc- 
essors alternate execution periods, memory SoRee Re ioe does not cause 
delays. The control requirements for the two algorithms are identical. 

Prediction of the delays due to parallel execution requires infor- 
mation about the multiprocessor structures. The convergence rates of 
the algorithms were compared in the last chapter. This chapter has pre- 
sented the number of operations that are required to compute a new esti- 
mate. As was the case when comparing the convergence rates of the algo- 
rithms, there is difficulty in comparing the number of operations required 
to compute the new estimate. The delays encountered in parallel solution 
differ with the form of the equations. Table III.F.1,developed from the 
equatiors of this chapter, shows the number of operations the algorithms 
require for equations with 5% nonzero elements within the blocks. Stagg 
[59], suggested that 5 to 7 linear iterations are required for every Newton 
iteration allowing a rough comparison of the algorithms’ solution time. 

The number of operations do not include the delays which result from 
parallel execution, so the resulting comparison cannot be final. It does 
suggest that the linear methods should converge in less time than the 
Newton algorithms. This is contrary to the experience gained with the 
single processor implementation, where the Newton method solves the model 
in less time. The use of the linear iterative methods raises doubts about 
convergence since these methods are known not to converge for some sets 
of equations. When the delays are estimated in the next chapter, the 
comparison will be refined. 

In spite of the analysis available, questions remained on the proper 


utilization of the SOR iterations with the Newton~-SOR method. Solution 
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time and sharing rate are dependent on the number of SOR iterations re- 
quired. Further the SOR iterations can be stopped before convergence 

is reached, and the Newton iterations should still converge. To test 

the effect of variation in the number of SOR eee the load flow 
problem for the Commonwealth Edison system of Appendix A was ordered to 
the near block diagonal form and solved by the Newton-SOR method. The SOR 
iterations were repeated until convergence, repeated seven times (approxi- 
mately half the number required for convergence), and only computed once. 
The highest overall convergence rate was obtained with the convergence of 
the SOR iterations. Because of the computer time required, the slower 
algorithms were only iterated ten and fifteen times respectively. The 
seven SOR iteration algorithm converged at approximately the same rate as 
the converged SOR iteration algorithm. The single iteration algorithm 
was much slower converging requiring , ten Newton iterations for every five 
of the other two algorithms when convergence was approached. The first few 
iterations were approximately the same. It is expected that more detailed 
study of the convergence properties of the Newton algorithms would find 
that the highest convergence rate would be achieved by increasing the accu- 
racy required of the SOR iterations as the Newton iterations approached 


convergence. 





CHAPTER IV 


PROPOSED MULTIPROCESSOR STRUCTURES 


Before the parallel algorithms were developed in Chapter Three, the 
rudiments of a computing structure were discussed. The computing structure 
consists of two levels of memory, multiple processors and a controller. 

One level of memory, the local level, is private to each processor. The 
other level of memory, shared memory, is connected to the processors by 

a common bus. The information required to be exchanged between processors 

is stored in this shared memory. The other processors can then read the 
information from shared memory. The difficulty is that only one processor per 
Memory cycle can access a single shared memory. The other processors must wait 
until the shared memory is free to exchange information. [In addition to 
exchanging data, the processors must pass information required to control 

the execution sequence. This information is much simpler, consisting of 

a signal from each processor designating which part of the execution se- 
quence the processor is performing and the status of the iterations. A 
controller would assimilate these signals and provide other signals to each 
processor to alter the sequence of ingevet von The most common signal 
would indicate convergence of an iteration, and would instruct the processors 
to advance the time step. 

This simple description indicates all of the major components required 
for every computing structure. First, each of the components is explained. 
Then the components are assembled into different structures. The ability 
of each structure to execute the dynamic simulation problem is predicted. 

The structures are only capable of efficiently executing one of the two 


classes of algorithms, either the algorithms using the near block diagonal 
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ordering of the equations of the simulation, or the algorithms requiring 
the bordered block diagonal ordering. 

After each structure is developed, the execution speed is predicted. 
The structures are modeled to estimate the delays which will result from 
contention over shared memory. The contention models used are derived 
from the multiprocessor model developed by Skinner and Asher [61]. The 
model yields a multipicative factor, called the stretching factor, which 
shows the amount by which the execution time is increased due to memory 
contention. This factor does not include the effects of the variation of 


the convergence rate between the parallel algorithms. 


IV.A Multiprocessor Components 


There are three parts to the computing structures considered here. 
The first part is the computing elements. The computing element consists 
of the processcr and local memory, and any other devices required for an 
independent computer. The second component of the structure provides for 
the exchange of data. For the near block diagonal algorithms it consists 
of a shared memory, a common bus, and an arbiter. For the bordered block 
diagonal algorithms, the cut-set processor handles the sharing of data. 
The final component is the controiler. It may be as simple as a few ex- 
ternal logic devices, or as complicated as another processing element. In 
fact, the controi functions may be assigned to one of the processing elements. 
In this discussion, as few specifications as possible are made. on the 
processing element. It is assumed to be a commercially available minicom- 
puter, although micros or fuli GP computers might as easily fill the position. 


A few features will enhance the execution of the dynamic simulation problem. 
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The major requirement is the efficient execution of arithmetic operations 
on floating point numbers. The sparsity programming of the dynamic simu- 
lation problem can more easily be performed for a processor with some in- 
dexing capabilities. The actual instruction set and execution speeds do 
affect the final structure, but the analysis of this thesis is based on 
a high level language implementation of the algorithms. The local memory 
associated with each processor is of any form efficient for that processor. 
The only requirement is that it is of sufficient size to contain the en- 
tire parallel portion of the program and data required by that processor. 
The ability of the structure to exchange information between processors 
determines the success or failure of the structure. If the information can- 
not be efficiently exchanged, the use of parallel processing is not effi- 
Cient. For the near block diagonal form algorithms, the simplest device 
wnich is used for exchanging data is the single shared memory. All processors 
can gain access to this memory by a common bus. Only one processor may gain 
access to the bus (and shared memory) during one shared memory cycle. When 
more than one processor requests the bus, an arbiter grants access to the 
bus to only one processor. The bus, arbiter, and the shared memory appear 
as one unit. If a processor gains access to the bus, then the requested 
information is provided at the end of the memory cycle. The other processors 
must wait until the next memory cycle and repeat the request. Because of 
the structure of the algorithms, after a processor gains access to shared 
memory, that processor waits several memory cycles before there is another 
request for shared memory. To insure against unreasonable delays, the 
number of processors should be limited so that, on the average, no processor 


will make two requests for shared memory before every processor requesting 
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shared Hesry has had at least one opportunity for access. 

The arbiter is a simple hardware device which insures only one proc- 
essor is granted access to the bus during a shared memory cycle. Because 
of the number of memory cycles between a single processor's request for 
shared memory, any scheme used by the arbiter to grant access should result 
in a "fair'’ order of selection. A true random, or round robin system 
would be preferred. But even sequential polling, giving priority to the 
closest processor on every poll, would nct significantly alter the delays. 

A true random arbiter was used for the models. 

The cut-set processor accomplishes the sharing of data for the bordered 
block diagonal algorithms. This could be achieved by a common bus connecting 
the cut-set processor to the other processors. The cut-set processor needs 
access to the local memory of these other processors only while they are idle. 
This eliminates the need for separate shared memory and an arbiter. 

If the portion of the local itienory of the diagonal processors where 
the cut-set variables are stored appears as a single memory to the cut-set 
processor, the efficiency of the bordered block diagonal algorithms will be 
increased. In this case, the cut-set processor will be able to store a 
copy of each of its variables for every processor with only one store in- 
struction per variable. 

The final part of the computing structure is the controller. The con- 
troller insures that the execution process proceeds in the prescribed manner 
by all processors. Action is required by the controller at the control 
points within the algorithms. The controller achieves synchronization of 
the processors at these points of the programs, and signals changes in the 


program, such as noving to the time step advance. The controller is not 





100 


expected to be able to provide all of the control required for major 
program changes, such as beginning execution of a new simulation. 

There are two possible devices to act as the controller. One, a 
totally external controller, uses logic hardware to provide control 
Signals based on the signals provided by each processor. The other con- 
troller, one of the processing elements, would examine shared memory for 
the status of all of the other processors and, based on this information, 
provide signals to alter the execution sequence of the other processors. 
However, the use of a processor as a controller implies the control func- 
tions will require many memory cycles to implement since the processor 
would have to perform operations to determine the control required. 

The simplest control scheme is achieved by each processor setting 
hardwdre flags with tne information representing that processor's current 
status. External logic would then use the information from these flags 
to determine the need for altering the execution sequence. If a change 
is needed the external logic either could interrupt the processors, with 
the interrupt providing the information required to alter the instruction 
sequence, or set flags which, upon reading by the processor, 
would indicate the new instructions to execute. By using external logic 
for control there is less delay than would result from the softwar 
approach of exchanging semaphores through shared memory. 

With the bordered block diagonal algorithms, the cut-set processor 
can perform the control functions without the delays which normally result 
from using a processor to control a general purpose multiprocessor, since 


the controlling functions occur when the cut-set processor is otherwise idle. 
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With all of the parts of the computing structures defined, the next 
sections show how they might be combined to execute the algorithms required 
for the solution of the dynamic simulation problem. The structures are 
divided into two groups: the structures for the near block diagonal algo- 
rithms, and the structures for the bordered block diagonal algorithms. 


First the model which predicts the actual execution times is described. 


LV.B The Multiprocessor Model 


The structures proposed must be evaluated to predict the delays that 
will result from sharing data, and the accompanying increase in execution 
time. The multiprocessor model developed by Skinner and Asher [63], is 
modified slightly to provide this prediction. 

The Skinner and Asher model use the theory of Markov Chains to predict 
the stretching effect on execution time delays due to memory contention. 

A One step transition matrix is developed. The elements of the matrix 

are the probabilities of the processors being delayed due to the other 
processor's actions. Skinner and Asher assumed each processor attempted 
to access shared memory by Bernoulli trials with probability p. When more 
than one processor attempts access, access goes to one processor with 
probability [l. The processors failing to receive access to shared memory 
repeat the request on the next memory cycle. Skinner and Asher assumed 
that a processor could repeat the request even if it was granted access 
the previous cycle. When the probabilities for each processor are the 
same, the model can be simplified to represent the major states of the 
multiprocessor. For example, a major state might represent three processors 


delayed waiting for shared memory. This is essential for models the size 


required for this thesis. 
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The model of Skinner and Asher was modified primarily to reflect 
the fact that a processor, would not request shared memory immediately 
after receiving it. The modification involved splitting the no processor 
delayed state into a state of no processors requesting shared memory and 
a state of one processor requesting shared memory. With the state 
split,the probabilities can be modified to reflect the restrictions on 
the processors'repetition of requests to shared memory, and the actual 
usage rates of shared memory can be predicted. If the probabilities 
were not modified this model would provide the same values as Skinner 
and Asher's. 

By solving the Markov Chain model (finding the steady state prob- 
abilities) the expected value of the occurrence of the different states 
can be predicted. These probabilities provide the information needed to 
predict the additional time that will be required to execute the problem 
because of memory contention. This information is expressed as the stretchiny 
factor and is computed as the inverse of the probability that a processor 
will not be delayed. The ratio of the multiprocessor total execution time 
to that of the single processor is the stretching factor divided by 


the number of processors. 


ry.C Structures for the Near Block Diagonal Aigorithms 


The Chaotic Relaxation and Newton-SOR Algorithms, are enhanced by 
ordering the equations of the simulation to a form with diagonal blocks 
and as few interconnections as possible. This near block diagonal form 
minimizes the sharing of data between processors, and improves the conver- 


gence rate of the Newton-SOR Algorithm. The near block diagonal form is 





103 


not essential for these algorithms. Likewise, the structures of this 
section are designed for near block diagonal equation solution, but they 
will execute the algorithms for nonsparse equations. The cost of non- 
sparse problems would be smaller problem size and increased delays due to 
memory contention. 

The Chaotic Relaxation Algorithm requires the ability for all processors 
to share data. Analysis in Chapter Three showed each processor will try 
to access shared data not more than two percent of the memory cycles. 
Control is required only to insure all processors advance the time step 
when all variables have converged. The Newton-SOR Algorithm has the above 
requirements, with seven and one half percent of the memory cycles going 
to shared memory during the iterations, and in addition,requires control 
to insure convergence of all processors during the secondary iteration. 

Tne simplest multiprocessor structure is the use of a common bus to 
connect a number of processing elements. Figure IV.C.1 shows one such 
structure. The common bus consists of address, data and, jcOmpro® lines. 
An arbiter would control access to the bus. The controller would consist 
of logic devices connected to each processor. This is the model assumed 
during the development of the Chaotic and Newton-SOR Algorithms. The ar- 
biter grants access based on the next processor desiring access in the 
loop. Graph IV.C.1 shows the stretching ertfect as more processors are 
added to the structure for the Chaotic algorithm. It also shows the ex- 
pected decrease in execution time. This is expressed as the percent of the 
time a single processor would require. The corresponding quantities are 
repeated for the Newton-SOR Algorithm in Graph IV.C.2. 


From the first graph it is apparent that the delays from memory 
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contention do not have a significant effect until almost all of the memory 
cycles are used. After all of the cycles are used, a further increase in 
execution speed is not attainable by increasing the number of processors. 
Saturation (100% usage) of shared memory does not occur as expected when 
the number of processors is the reciprocal of the individual processors 
shared memory access rate. This number of processors and their expected 
problem solution time are called the idealized number of processors and 
idealized multiprocessor execution time respectively. Because of the 
stretching effect, the number of processors required to saturate shared 
memory is larger than the idealized number, and even with this larger num- 
ber of processors the execution time is longer than the idealized execution 
time. For example, with Chaotic Relaxation, each processor could be expected 
to use shared memory for two pércént of its C6tal mhétiory cyclés. Fifty 
processors would be used in the idealized multiprocessor structure, achieving 
solution in 2% of the time that a single processor would require. Hcwever, 
because of the stretching effect of memory contention, fifty processors use 
shared memory for only 95% of the total possible access to shared memory, 
and solution would require 2.17% of the time that a single processor would 
require. Addition of approximately twenty five more processors would only 
reduce the time required for solution down to 2.1% of the single processor 
solution time. The idealized two percent can never be reached. 

To chose the number of processors which should be included in the 
multiprocessor is a difficult problem. It requires setting a value on the 
cost of an additional processor and a value on the time saved by using an 
additional processor. Clearly the addition of processors beyond the number 


which saturates shared memory gains no decrease in execution time. Below 
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this number, every additional processor results in a smaller reduction 

in execution time. The graph of the solution time versus the number of 
processors becomes almost level when the number of processors reaches 
between two-thirds and four-fifths of the idealized number. For general 
simulation studies it is expected that this range includes the appropriate 
number of processors to include in the multiprocessor. It is difficult to 
foresee using more than this number of processors. 

The easiest methcd to improve the performance of this structure is to 
increase the rate with which the shared memory can provide data. If the 
shared memory had an access rate twice that required by the processors, 
or if shared memory had multiple access ports then the data could be sup- 
plied to two processors every memory cycle. Graphs IV.C.3 and 4 present 
the résutts for tirts stracttre with even muttiplés of tte shared méitory 
access rate. 

Similar results are obtained with a multiprocessor having multiple 
shared memories. Figure IV.C.2 shows a two memory version. Each processor 
is connected to two common buses. Each bus has its own arbiter and shared 
memory. With two shared memories each processor would access one of the 
shared memories roughly haif as often. (Equal portions of the shared data 
is stored in each memory.) Each of the buses in the multiple shared memory 
version would be equivalent to the single bus of the first structure. 
Graphs IV.C.5 and 6 show the results for the multiple shared memories. 

(As the structures become more complicated the models of the structures must 
be simplified for solution.) 


The multiple memories can also be further improved by increasing their 


access rate. 
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Another structure uses multiple shared memories but does not commnect 
all processors to each memory. Depending on the problem the decomposition 
of the dynamic simulation problem may be structured so all of the shared 
variables are not required by all of the processors. A savings in cost 
could be achieved by minimizing the connections. Modeling such a structure 
for the number of processors typically used requires many states to be 
represented. With this structure the individual probabilities for each 
group of processors would have to be specified, leading to a prohibitively 


large model. A typical structure of this type is shown in Figure IV.C.3. 


IV.D Structures for the Bordered Block Diagonal Algorithms 


The bordered block diagonal algorithms use the parallelism of the 
equetioms v= the simulation to exeerte the Gavues-Seidel and the Newton 
method in perallel. This decomposition of the equations requires that the 
cut-set processor be capable of communicating with all of the other proc- 
essors. The other processors need not communicate with each other. This 
places the cut-set processing element in a central location, with all of 
the other processing elements connected to this processor. The controller 
must still receive data from all processors, and disseminate signals to 
alter the execution sequences of all processors. Because of the central 
location cf the cut-set processor, it is very capable of perfcrming the 
necessary control. 

The structures of the bordered block diagonal algorithms need to be 
divided further. Both the Gauss-Seidel and Newton Algorithms can be im- 


plemented with the cut-set processor's instructions overlapping the other 


processor's instructions, or with the cut-set executing only when the 
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Other processors are idle and vice-versa. With no overlap of instructions, 
the multiprocessor structure is simplified because of the removal of memory 
contention. Thus no arbiter or even shared memory is required. There is 
cost in terms of execution speed and in the usage of the cut-set processor. 
The cut-set processor could be used for convergence testing and step size 
control. 

Figure IV.D.1 shows a suggested computing structure for an implementation 
with no overlapping of instructions. This structure is simpler than the 
first multiprocessor of Section C, because there is no possibility of memory 
contention. The shared memory is actually part of each of the processing 
elements local memory. The cut-set processor can address locations of the 
Other processors memory over the bus. It is suggested that part of the 
local menories fave the same addréss, so that when the Cut~-sét processor 
stores the values of its variables, each processor receives a copy. The 
cut-set processor can also address each local memory individually for the 
values of the local variables it requires. When the cut-set processor is 
idle, the bus is off so the other processors do not interfere with each 
Other. A simple external controller, or the cut-set processor itself,is required 
to signal the cut-set processor to begin execution after the other processors 
have gone idle. The cut-set processor can signal the other processors to 
resume execution as its last instruction before going idle or intoits control mode. 
The cut-set processor could also detect convergence and deposit information 
to signal the processors to advance the time step. 

In Chapter Three a possible increase in execution rate was shown by 
having the cut-set processor and the other processors execute instructions 


concurrently. This complicates the structure of Figure IV.D.1 greatly. 
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Now there must be separate sections of shared memory, and arbiters to 
control access to these memories. Figure IV.D.2 shows the suggested 
additions to be made to the structure. One advantage is that now the cut- 
set processor is capable of providing all of the control functions. 

Even with the overlapping of instructions the delays which result from 
memory contention are inconsequential. However, there is another problem 
dependent feature which extends the execution as the number of processors 
grows. When a problem is decomposed into more parts the number of vari- 
ables in the cut-set grows. Since the execution of the cut-set equations 
demands time proportional to the number of cut-set equations, an increase 
in their number greatly affects the execution rate of the entire problem. 
Each separate problem will decompose into blocks which will call for a 
processor fer each block. An increase in the reanber of processors should 
be viewed as an increase in the size of the problem to be executed rather 
than a method to reduce the time required to execute a single problem. 
Graph IV.D.1 shows how the execution time decreases as more 
processors are added. The execution rate of the bordered block diagonal 
algorithms depends on the number of cut-set variables. As their number 
increases the largest portion of the processors are idle for longer periods. 
The increased idle time of the processors prevent the solution time from 
decreasing further. 

One feature which if added would slightly increase the 
execution rate, but would be of more benefit by removing the need for the 
many arbiters, is dual port shared memory. The local processors could be 
assigned the first of the two possible memory cycles and the cut-set 


processor the second. Synchronizing the processors out of phase would be 





116 


Arbiter 


Shared Pro- 
memory| _ cessor 










local 
nemor 


SFOBSCH 
Ay Mm 

Oo fy’ 
EE 

Oo 

> ee?) 

<< Q 
ed ee ee 















Arbiter 
B 
+ > 
— lint cs 
a menor 






AT Delt 


shared! 
memory) 


ig 






local 
memory 


Pro- 






cessor 





: menor: 
| 
: 
Pro- 
aatt = 
cessor| Vurqset 


Bordered Block Diagonal Form Multiprocessor 
CyverlLapping Instructions 
Ficure IV.Dee2 








Ss 


Ora tH NM 


HOnNnN DAO Md 


Of SO Fr cte QO 0  & 


Gauss-Seidel 


\ 





increase in processors —? 


Approximate Effect of Number of Processors 
on Execution Time 


Graph IV.D.1 


117 





ES 





uch simpler than the control required to insure single processor access 


requirements. The added cost comes from the increased memory speeds. 








CHAPTER V 


DECOMPOSITION EFFECTS 


This thesis has studied the dynamic simulation problem to determine 
the parallelism inherent in the problem and the requirements of a computing 
structure to exploit this parallelism. In the process of this study, the 
problem of decomposing the equations into blocks with the minimum number 
of interconnections has surfaced as a strategic factor for parallel ex- 
ecution. Although each system of equations only has to be decomposed the 
first time it is to be simulated, the methods of achieving this decompo- 
sition are haphazard at best. The algorithms which do exist are too in- 
efficient to apply to sets of equations of the size described in this 
thesis. In addition, the actual decomposition has different effects on 
each of the algorithms. The least affected algorithm is Chaotic Relaxa- 
tion and the most affected is the true Newton. The convergence rates of 
the Chaotic Relaxation and Newton-SOR schemes are affected by the actual 
decomposition, but only the number of operations required to compute the 
iteration is affected for the bordered block diagonal form Gauss-Seidel 
and Newton Algorithms. In fact, the same number of iterations are required 
for the parallel bordered block diagonal form algorithms as for the serial 
Gauss-Seidel and Newton Algorithms. 

In this chapter the difficulties of the decompositions are studied 


and the effects are described. 


V.A The Problems of Decomposing Equations 


Kron [44], suggested solving large sets of equations by Wweearme 


the problem into smaller parts. For the power system problem, Kron 


ee 
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proposed tearing the network equations into small parts by eliminating 
some connecting lines. The solution of the individual Parts was then 
iterated to balance the values on either side of the torn line. The 
modifications used to produce the convergence of the iterations, required 
that the equations of the smaller parts be linear. The requirement of 
linearity is considered too restrictive for for the dynamic simulation 
problem. Even with Kron's method, the choice of which lines to tear de- 
pended mainly on the intuition of the system analyst. 

The notion of cutting the sets of equations comes from the use of 
optimal ordering for LU Factorization. By properly ordering the equations, 
the number of nonzero coefficients formed by the LU Factorization can be 
minimized. This minimizes the number of operations required to solve 
the equations. Several methods have been developed to properly order the 
equations, but only by trying all possible orderings is a true optimal ordering 
achieved. Associated with the optimal ordering is the construction of 
computer programs to perform operations only on nonzero coefficients. 

The sparsity programming techniques require a large amount of computation 
to calculate the next nonzero coefficient. The data structure problems 
can be simplified by concentrating the nonzero coefficients into blocks. 
The blocks can be programmed without the use of sparsity programming. 

The use of these blocks reduces the execution time of the algorithms com- 
pared to the nonblock methods with sparsity programming. 

The methods used to find block orderings are related to the methods 
used for optimal ordering. To find the decompositions which minimize the 
nonzero coefficients outside the blocks would require essentially trying 


all possible orderings. Methods are presented to find approximation to 
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this block decompositions. 

Carre [46] proposed a method of ordering the equations to the near 
block diagonal form. This method minimizes the value as well as the number 
of off block diagonal nonzero coefficients. The minimizaticn of the value 
of these coefficients reduces the number of iterations that the Newton-SOR 
method requires. This scheme begins by forming a graph model of the equa- 
tions. The branches are then listed in the order of the absolute value 
of the coefficient of that branch. The branches are added in decreasing 
order of the value of their coefficients until a tree is formed. (Non- 
tree branches are skipped when necessary.) The disconnected parts of the 
tree correspond to the different blocks of equations. If a block turns 
out to be larger than permissible that block can be subdivided by applying 
the same procedure to it. After choosing the variables the equations within 
each block should be put in optimal order for the LU decomposition. 

The bordered block diagonal form also orders the equations so that 
the nonzero terms are in diagonal blocks. When the number of blocks is 
increased, the equations with coefficients of noncut-set variables which 
cannot be ordered into a diagonal block become a member of the cut-set. 

The cut-set equations form the last block. Two methods of finding this 
ordering are presented. Each is based on a graph model of the equations. 

To find the smallest set of equations, the cut-set, whose deletion will 
divide the remaining equations into non-interacting sets, is equivalent 
to the min-cut max-flow Graph Theoretic Problem. The graph theoretic 
procedures can be used to find the cut-set. To do this, the highest degree 


vertex of the graph model of the equations is labeled the source. The 
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next highest mode not directly connected to the source is labeled the 
terminal. The minimum vertex cut-set is then found between the source and 
the terminal by an algorithm such as the one presented by Frank and Frisch 
[49]. If the number of vertices in this cut-set is equal to the degree of 
the terminal vertice then the source vertex ee ias a member of the last 
block. If the minimum cut-set is smaller,then the cut-set is put in the 
last block. In either case the nodes of the last block are eliminated and 
the graph reduced again. If only the source vertex was added to the cut- 
set then the process is repeated until the graph is disconnected. 

For grapns representing power systems networks, finding a useful decom- 
position may be difficult. The power system has been designed to insure con- 
tinued operationin spite of failures in the system. Deletion of the cut- 
set is equivatent to those buses failing. The cut-set is therefore much 
larger for the power system network, than for a typical random graph for 
the same size and sparseness. When the algorithm presented by Frank and 
Frisch [49] is applied to a power system graph, the cut-set is typically 
the nodes adjacent to the terminal node. This produces the diagonal 
biock of only the terminal node. A block size of one is not efficient for 
parallel execution, so the algorithm must be forced to find larger blocks. 
This increases the computation and reduces the optimality of the solution. 
Further the algorithm dces not provide for finding more than two blocks 
at a time except for symmetric graphs. The minimum cut-set for dividing 
a graph into two subgraphs may not be contained in the cut-set for dividing 
the graph into three subgraphs. There is speculation in the literature 
that optimal ordering and block decomposition is an NP Complete problem 


in graph theory. [72] 
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A much more direct solution process has been proposed by Ogbuobiri, 
me al [45] for power systems. This method overcomes 
some of the problems of the graph theoretic procedures by attacking the 
problem in a different manner. The method groups the most highly connected 
nodes, allowing the cut-set to be chosen as the least connected nodes. The 
method does not help to decompose a large block into smaller ones. It 
can also be used to find the near block diagonal form by not extracting 
the cut-set variables. 

A great deal of computation can be saved in the decomposition problem 
by reducing the graph model. The reduction process combines all nodes of 
degree two Or less with their adjacent node. Further, all parallel paths 
and self loops are deleted. The resulting reduced graph can be shown to 
be either 1) a graph of five or more nodes all of degree three or greater, 
2) a complete graph of four nodes, or 3) two nodes connected by a single 
branch. The lattcr two of these reduced graphs can be put into bordered 
block form easily. The complete graph of four nodes requires three of 
the nodes to form the cut~-set. Depending on how the reduction proceeded 
there will be one, two, or three other blocks for the diagonal. A graph 
of two nodes with one branch comes either from parallel paths or one path 
of nodes of degree two. For parallel paths the cut-set is the two re- 
maining nodes and each path is another block. For one path a center node 
is the cut-set with either side the other blocks. Only the reduced graph 
of five or more nodes needs further study to find the cut-set. r 50] 

The power system networks do provide some benefits. The graph can 


be related to geographical locations which can aid in locating possible 
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cut-sets. Experience with the system will lead the analyst to knowledge 
of the weakest links in the network. These weak links are natural choices 
for the cut-set. For power systems, reliability studies may show which 
buses are the weakest links in the network. 

For the transient stability problem, there is one feature which can 
be used to always identify a cut~-set. By reducing the network equations, 
the resulting non-sparse block of equations acts as a cut-set for the 
generator and load models. These models only connect through the network. 
This method is used in [29] to reduce the execution time for a single 


processor by simplifying the sparsity programming task. 


V.B Effects of Decomposition on the Near Block Diagonal Algorithms 


The difficulty of finding the optimal ordering for both the near 
block diagonal form and the bordered block diagonal form was presented in 
the last section. In this section the effects of the ordering are pre- 
sented to allow the analyst to choose the degree of decomposition allowable. 
For the Chaotic Relaxation and Newton-SOR schemes, the decomposition aids 
in reducing the amount of sharing of data required and by improving the 
convergence rates. However, these two algorithms will achieve the correct 
solution without the decomposition. 

For the Chaotic Relaxation the exact decomposition does not affect 
the number of operations required to perform one iteration. It does vary 
the individual processor's rate of requesting shared memory. This rate is 
directly related to the number of nonzero off block diagonal coefficients. 
But even for dense equations the highest rate of requesting shared memory 
is trivially small. 


The remaining effect is the change in the convergence rate of the 
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Chaotic Relaxation with different decompositions. Proper decomposition 
can improve the convergence rate considerably. With improper decomposi- 
tion the updating of the variables could become a Jacobi scheme, where 
the values for the variables used to compute the update are an iteration 
Or more old. With block diagonal decomposition the updating sequence 
could be a Gauss-Seidel scheme. Without the block diagonal decomposition 
the equations are not updated exactly in sequential order. This reduces 
the convergence rate to a value less than that of the Gauss-Seidel method. 
The convergence rate is reduced further when an old iteration value is 
used to compute an update. This can only occur from a shared variable. 
The possibility of using an old iteration is reduced primarily by in- 
suring that each processor requires approximately the same number of 
Operations Co compute an iteration. (Equally sized blocks is the simplest 
measure of this time.) Also minimizing the number of external vari- 
ables required by a processor reduces the possibility of using an old 
iteration value. 

For the Newton-SOR Algorithm the rate of using shared memory is in- 
dependent of the decomposition (except for totally disjoint equations). 
This results from only the nonzero off block diagonal entries being used 
during the iteration for the Newton update. The actual number of opera- 
tions required for each iteration and the convergence rate of the Newton 
update does depend on the decomposition. The number of Newton iterations 
is not affected by the decomposition. 

To minimize the number of operations, the number of columns outside 
the block diagonal part of the Jacobian which contain nonzero values must 


be minimized. The actual number of nonzero entries in each of these 
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columns is not significant. Thus when the decomposition is found, the 
number of external variables required by each block is the value which 
must be reduced. 

The convergence rate of the Newton update is maximum when the proc- 
essors all require the same number of external variables. However, the 
convergence rate is not as significant as the increase in the number of 
Operations caused by the addition of more external variables. (The number 
of operations grows as fee) If given the prerogative the off block diagonal 
entries should be of the smallest absolute magnitude, to provide the highest 
possible convergence rate. 

To summarize,the decomposition for the Chaotic Relaxation Algorithm 
should strive for equal sized blocks even if additional off block diagonal 
entries are required. For the Newton-SOR Algorithm the number of external 


variables must be minimized. 


V.C. Effects of Decomposition on the Bordered Block Diagonal Algorithms 


The decomposition of the network equations provides the only high 
level parallelism possible in the true Newton and Gauss-Seidel Algorithms. 
If the equations could not be decomposed, there is no straightforward 
method of using a multiprocessor to reduce the time required for the so- 
lution of the dynamic simulation problem by these algorithms. Experience 
indicates it is reascnable to assume that the equations can be decomposed 
to the bordered block diagonal form. In this section the properties of 
the cut-set are related to the execution rate. 

Since the parallelism is found in the equations to be solved and not 


in the algorithms used to solve them, the convergence rates of the algorithms 
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are not aitered. All of the solution steps required by the algorithms are 
performed in the same manner as they would be for serial execution. This 

results in the solution being found in the same number of iterations, re- 

gardless of the number of processors used to calculate the iterations. 

With the bordered block diagonal form algorithms, the number of op- 
erations per iteration is the sum of the operations for the parallel blocks 
and the cut-set block. By increasing the number of blocks, more processors 
can be used to reduce the time required to compute the diagonal BICERE 
portion of the iteration. But the solution time of the cut-set is not 
reduced, further the size of the cut-set is typically increased by decom- 
posing the equations into more blocks. With more equations to solve,the 
cut-set processor requires a larger portion of time, so that increasing 
the number of blocks may actually increase the time required for solution. 
Since the Newton and Gauss-Seidel algorithms require such differing amounts 
of computation per iteration, the effects of increasing the size of the 
cut-set is different. 

For the Gauss-Seidel Algorithm the cut-set variables must be iterated 
while the other processors are idle. This requires the cut-set processor 
to sum the portions of the update from each processor. But not all cut-set 
variables are affected by ail processors. Thus the computation for the cut- 
set processor only grows as the sum of the number of cut-set variables 
affected by each processor. There is no difference to the cut-set processor 
between only one variable of a diagonal block affecting a cut-set variable, 
and all variables of the diagonal block affecting the cut-set variable. 


Since so little computation is required by the cut-set processor, for the 
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bordered block diagonal form of the Gauss-Seidel Algorithm, an increase 
in the size of the cut-set has only slight effect. 

For the true Newton algorithm the cut-set processor must sum the 
modifications from all of the other processors and then solve the resulting 
equations. The solution time for these equations grows as a. where n is 
the number of cut-set variables. Since the equations of the cut-set are 
dense after the modifications are summed, more operations may be required 
to solve a smaller number of equations than the other processors require 
to solve a larger number of sparse equations. Thus when the number of 
blocks is increased, increasing the size of the cut-set, the computation 
time of the cut-set is dramatically increased while the time for the diag- 
onal processors is only partially decreased. The true Newton Algorithm 
requires that the number of cut-set variables be minimized to yield the 
fastest possible solution. 

For the bordered block diagonal form Gauss-Seidel Algorithm it is 
possible to use more than one cut-set processor, by assigning different 
cut-set variables to each processor to update. The cut-set variables would 
have to be assigned so that the variables would not depend on variables 
assigned to the other cut-set processors. However, seldom do the cut-set 
variables depend on other cut-set variables. The cut-set processors would 
have to compete for the available accesses to the local processor's mem- 
ories for the cut-set processors the available accesses are sufficient ror 
several processors to share without serious contention. The use of several 
cut-set processors would relieve the bottleneck caused by the solution of 
the cut-set variables, resulting in even higher execution rates than pro- 


iected in this thesis. 
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To use multiple cut-set processors for the bordered block diagonal 
form Newton Algorithm requires a different philosophy in finding the cut- 
set. The cut-set variables must be divided into layers, and the solution 
process itself gains another layer where processors are idle. The cut-set 
is cut into nearly disconnected parts whose complete solution requires 
another cut-set processor. To find a decomposition which is itself de- 
composed into disconnected parts is not difficult. First the set of equa- 
tions are cut into two near equal subsets. This first cut is the lowest 
level of the cut-set. The large subsets are then cut into smaller blocks. 
But the cut-set from each subset is not dependent on the equations of the 
other subset. Solution of the Jacobian equations can now proceed in five 
steps. First the diagonal blocks are triangularized and the last block 
row eliminated. The subset cut-Set processor then sums the modifications 
to its variables, and forms the modification to the lower cut-set. This 
lowest cut-set is solved and the back substitution begins. The subset 
cut-set uses the lowest level cut-set variables to find the new value of 
the subset cut-set variables. Finally the diagonal blocks can use the two 
level of cut-set variables to complete the required back substitution for 
the remaining updates. 

Using the decomposed form of the cut-set variables can save execution 
time even when only one cut-set processor is used. The decomposed form 
forces sparsity into the final cut-set equations, greatly reducing the 


number of operations required to solve for the cut-set updates. 





CHAPTER VI 


CONCLUSION 


The dynamic simulation problem can be structured so that a high 
degree of macroparallelism exists in the solution techniques. The macro- 
parallelism results not only from the algorithms used to solve the dyna- 
mic simulation problem but also from the actual equations of the models. 
Simple multiprocessors have been proposed which can use this macroparal- 
lelism to greatly reduce the time required to solve the dynamic simulation 
problem. 

The model of a dynamic system has been considered to consist of a 
large number of nonlinear differential and algebraic equations. In order 
to efficiently solve these equations the differential equations are ex- 
pressed as algebraic equations through the use Of implicit multistep inte- 
gration methods. The entire set of equations is now solved by the numer- 
ical methods applicable to nonlinear algebraic equations. Four algorithms 
have been studied in detail, Chaotic Relaxation, Gauss-Seidel, Newton-SOR, 
and true Newton. Macroparallelism is present in the Chaotic and Newton-SOR 
algorithms. The Gauss-Seidel and Newton methods depend on the parallelism 
for the actual equations for parallel solution. 

The convergence properties of the parallel solution methods are basi- 
cally the same as the serial convergence properties. Where convergence 
can be proven, the rates of convergence of the meenods have the same rela- 
tive ordering as they have in a serial implementation. The true Newton 
and Newton-SOR methods converge most rapidly, followed by then the Gauss- 


Seidel and Chaotic Relaxation Algorithms. The Newton-SOR method requires 


iS 0 





rt 


two levels of iteration. The major iteration converge at the same rate 
as the Newton method, while the minor iterations converge at a linear 
rate. The Gauss-Seidel method can be shown to converge at a higher rate 
than the Chaotic Relaxation, but the rate is only slightly higher. 

A feature of almost all large sets of equations, especially those of 
the dynamic simulation problem, is sparsity. Because of the sparsity 
present in the equations of the models, the equations can be ordered to 
either the near block diagonal form or the bordered block diagonal form. 
These forms concentrate the nonzero coefficients of the equations into 
diagonal blocks. The parallel solution techniques then solve the blocks 
of equations in parailel. Information must be exchanged between the 
parallel solution streams for those nonzero coefficients which cannot be 
arranged into a diagonal biock. 

The near block diagonal form of the equations reduces the required 
sharing of solution data between the processors for the Chaotic Relaxation 
and Newton-SOR algorithms. This form also decreases the number of opera- 
tions and increases the convergence rate of the Newton-SOR method. But 
arranging the equations to this form is not an absolute requirement for 
parallel execution. 

By arranging the equations into the bordered block diagonal form, 
macroparallelism can be obtained in the Gauss-Seidei and true Newton 
methods. For these methods the degree of parallelism depends entirely 
on the actual decomposition of the equations. Parallel solution methods 
for these two algorithms only slightly increases the number of operations 
that must be performed for an iteration, even though many processors are 


performing the operations concurrently. The convergence rates of the 
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algorithms is not affected by parallel solution. 

The algorithms can also be ranked by the number of operations each 
requires to compute a new estimate of the variables. In general for the 
same problem the Newtcn-SOR algorithm requires the largest number of op- 
erations per iteration. The Newton method is next followed by the Gauss- 
Seidel and then the Chaotic Relaxation. There are several orders of mag- 
nitude difference in the number of operations required for the Newton 
methods, compared to the linear methods. It has been reported that the 
linear iterative methods require on the average 5 to 7 iterations to re- 
duce the error an amount equal to one iteration of the Newton methods. 

To find an overall ranking,the delays due to parallel execution must be 
included with the convergence properties. 

The two methods of ordering the equations basically determine the 
requirements for sharing data and thus the structure of the multiprocessor. 
The algorithms determine the rates at which the data must be shared and the 
control required for execution. The sharing of data causes delays in the 
execution either because the required information has not yet been computed 
or the device containing the information is busy servicing another processor. 
By minimizing the amount of information which must be shared, the delays due 
to sharing this data are reduced. The delays due to control of the solution 
process are inherent in the algorithms. Some of the control delays can be 
reduced, but the gains achievable are not as significant as possible for 
the sharing of data. 

The algorithms of the near block diagonal form, exchange solution data 
conveniently through shared memory.fhese algorithms, the Chaotic and Newton- 


SOR, have the same control and shared data requirements for parallel execution, 
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and thus can be executed on the same multiprocessing structures. However 
since the Chaotic Relaxation algorithm requires shared data at a much lower 
rate than the Newton-SOR algorithm, approximately three times the number of 
processors can be executing this algorithm in the same structure with approx- 
imately the same delays from memory contention. The Newton-SOR algorithm 
requires several times as many calculations per iteration as the Chaotic 
method, and will converge at a higher rate. Since the number of operations 
and the convergence rate is heavily dependent on the actual problem, an 
actual ranking in terms of solution time requires problem solution experi- 
ence. 

The bordered block diagonal form algorithms exchange data in an entirely 
different manner so that contention from shared memory is not a problen. 
From the memories’ point cf view the Gauss-Seidel and true Newton algorithm 
appear to be executed by a single processor. This is because the cut-set 
pceocessor accomplishes the exchange of data while the other processors are 
idle. Thus the delays in parallel execution result from processors being 
idle waiting for data to be computed, and the delays become longer as the 
number of variables in the cut-set are increased. As a result the advan- 
tages derived from parallel execution depend on the particular problem 
being solved. Experience has shown that usually,sparse sets of equations 
can be arranged into the bordered block diagonal form. By combining the 
information presented, estimates of the actual decreases in execution time 
can be predicted. By comparing the number of operations required for 
serial execution to the number of operations in the longest parailel 


stream plus any concomitant delays,the decrease in execution time can be 
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predicted. If the estimates of the convergence rate are included, a rough 
ranking of the parallel algorithms can be obtained. 

For the Chaotic Relaxation algorithm a single processor implementation 
would execute the same instruction sequence but for more variables. However 
the convergence rate of the single processor (Gauss-Seidel) would be higher 
than that of a multiprocessor. Arranging the equations inthe near block diag- 
onal form,to reduce the sharing of data between processors, prevents the equations 
from being divided equally between the parallel processors. Therefore, 
if p is the number of processors, 6, the fraction of additional operations 
in the longest parallel path, and s is the stretching factor due to memory 
contention; then a multiprocessor couid execute an iteration of the Chaotic 
Relaxation algorithm in (1+6)s/p of the time a single processor would re- 
quire. Sample coding indicates that on simple single shared memory multi- 
processor,the Chaotic Relaxation algorithm could be solved by 40 processors 
in parallel in 2.7 percent of the time that a single processor would re- 
quire. By increasing the complexity of the multiprocessor, even further 
reductions in execution time are possible. 

With the Newton-SOR algorithm comparison is more difficult since 
there is usually no reason to use the Newton-SOR algorithm on a serial 
processor, The true Newton solution could be found in fewer operations. 
When the equations are too dense to be able to achieve the bordered block 
diagonal form or the linear iterative method does not converge, then the 
Newton-SOR method must be used. For completely dense sets of equations 
it does represent a possible savings in execution time over the true Newton 


method. Because of the higher rate required for sharing data, fewer 
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processors can share the same memory efficiently. A multiprocessor with 

10 processors could execute the Newton-SOR algorithm in *& the time a 
serial processors could solve the true Newton algorithm. (This figure 

is based on the sparsity and decomposition of the power system problem used 
in Chapter Three and discussed in Appendix A.) 

The bordered block diagonal form algorithms are even more problem 
dependent than the Newton-SOR Algorithms. The parallelism of the Gauss- 
Seidel and true Newton algorithms depends on the achievable decomposition 
of the equations. The delays of parallel execution are based on the time 
required to solve the cut-set equations. The number of processors capable 
of executing the algorithms in parallel depends on the number of diagonal 
blocks into which the equations decompose. If the number of operations 
required to solve the cut-set equations is equal to the number of operations 
required for the largest diagonal block, then the solution time for the 
multiprocessor is 2/pof the single processor time. For the problem used 
for the Newton-SOR algorithm the decomposition indicates five processors 
could efficiently execute the problem. Other decompositions exist that 
could use more processors, and if larger problems were used for a basis 
then it is expected that many more processors could be used. for this 
problem the Gauss-Seidel algorithm could be solved in 1/3 the time of a 
serial processor and the true Newton algorithm cculd be solved in DS 0f 
the time. 

Methods have been discussed to improve the decompositions and increase 
the complexity of the multiprocessors to reduce the time required fer par- 


allel solution even further. With these improvements the bordered block 
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Giagonal form multiprocessors would probably provide the fastest solution 
of the dynamic simulation problem. However, since the Chaotic Relaxation 
Algorithm is capable of being executed on higher order multiprocessors, 
the increased parallelism may provide the fastest solution. Only through 
actually solving many problems by all algorithms can a preferred method 
be established. The results do indicate that parallel solution of the 
dynamic simulation problem is feasible and the multiprocessors structures 
required to execute the parallel algorithms need not be highly complex. 

This thesis has shown many directions for further research, from 
theoretic numerical methods to hardware controller implementation. The 
mest obvious missing data is experimental evidence that these multiproc- 
essors can easily be built. Unfortunately, the time requirements pro- 
hibited this approach, even though it is exvected that low level parallel 
dynamic simulation could be performed on the Computer Science Research 
Network of Northwestern University. Another possibility is to model the 
proposed multiprocessor structures with a microprocessor network which 
would test the control requirements and delays from sharing data. Still 
for complete assurance of success, the multiprocessors must be built and 
the algorithms tested for many problems. 

An unsolved problem which seems to be reappearing in the literature 
is the decomposition and optimal ordering procedures [72]. Efficient 
solution procedures would greatly benefit this class of problems. 

There is still a wide gap between sets of equations which can be 
proven to converge under an algorithm, and actual convergence. The re- 
strictions necessary to prove convergence limit the ability to compare the 


algorithms execution rates. 





APPENDIX A 


A POWER SYSTEM EXAMPLE OF A DYNAMIC SIMULATION PROBLEM 


Whenever a specific problem has been required to complete the analysis 
of this thesis the Commonwealth Edison High Voltage Distribution System has 
been used. The distribution system consists of twelve generating stations, 
ninety-five busses, and 143 lines, seven of which are parallel to other 
lines. The loads are all modelled by constant loads. 

The equations of the model are equivalent to those of [29], and are 
developed in [58]. The equations can be found in the programs of Appendix B. 
A list of the busses and lines connecting each bus is given in Table A.1 and 
ae. After the network has been modelled by a graph, it is reduced by the 
procedures of Section V.A. The reduced graph is a much more manageable 
size with forty nodes and sixty-nine lines. The reduced graph model is 
presented in Figure A.3. The reduced model can be decomposed, enabiing 
the entire network to be decomposed. A typical near block diagonal form 
decomposition is given in Figure A.4, and a bordered block diagonal form 
decomposition in Figure A.5. Neither of these decompositions are unique 


nor are the decompositions optimal. 
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me SUMROCUTINE CHASCTIC ( CeVeYeP ) fae ee ee Bea 
Cc 
CC F4eree THIS USES CHAD TL CeREL. A XOeTORON re SCLVE THE TRANSIENT f$oOGaS 
. CC thenwne STABLITY PROALEM - 8g Ce bees 
COMPLEX CoVeVYeSUMsCOXs ONE 9? 


fee - OOITMENSION Y(100- 20) +V (100) eP(100),ER(1N) 9C (300) sNROW(100) 6 1V(100¢100) 


1 20) ¢ [SVteeo ) 


C 
eg, 2 wrth ett) <> DATA SHCULO BRE INSERTED HERE are: Dit Oo 
C Stor eir ALL INFORMATION IS ASSUMED TO BE PRESENT AND FILLED IN 
c SEORAR YY KY THE Dates CAROS COU 
C a Se i | | oe ten eh, 5a) ae _ 
G 
C feted Y> 1S SORBED IN SPARSE FORMe JHE FIPST ELEMENT CF SoGeeGee 
eee =| ete Ot EAGH ROWE IS THE BRVERSE CF THE DISGSNAL ELEMENT 8 eeESe 
C tees AE CC TMERENONEROMREIEIIIGMARE COMPRESSED 16 THE N@&XT & cine oo O 
c 
- Cc Ho cage STARTECE PARALLEL TE RAT ICSES i eS tah tb Oho 
c ao SUG oR nee ta MANE SRT CR TO RHIERING LOSP ae & & 
Cc HOGG ed TIPE WILL BE ADVANCED BY CCOE AFTER CONVERGENCE HOVE LEDS 
--~ ¢ OU OeD CODE WILL REINTIALIZE FOR NEW TIME STEP SOOO EE | 
C 
IBS = JuCL=}) ¢ 
= _ TRL = TRtv) = eer > i ee ie » ee i fa, Aa ee eae 
, IPNT = 49 
2 KNT = O 
Se Sn 1Genv = O ee _ ees 
Oo) = (CCL) POSTINI THOIYr }= ATHMGIC(T) pecans (THO)) 
COOL = eS Cos TiN) + Se Ae 
= @ = 2 ; 
C GOES HHO GENERATOR STATE VARIABLES SOGEES 
Pc 
OTHOAL = WO) = ae Pe 
SOR svt (THOJP « HHeDTHOAl ) + AY THO?D 
Mo { COR GT. GEAR) SROONV ad 
fear) = [HCL «+ COR ; fo 
OWO]l = RNC) (PMG) - (EQ27O1¢CO0] ¢+CDEO) #CN01 ¢XQXN01 40001 "CHN1) 
} = OOF01 ) | 
_ COR = we (vinyP ¢ HH*MWOl) +AWwetwO] : Hoe 
IF ( Cen ebTe Fietre ) MC GEV = } 
WO} = WO] + COR 
PEQORY = T1IbaAle (VE QT ~E9GL Re XOXOO1PCNNT ~ SO!’ ) = 
COR = We(EOOTP « HHENTAGQ) 3» + AwFtEaQol 
ieee COR CGT. Fem ) h6UICONV = 1 
Fis] = AD} «+ GER 
OFOd) = TIRGP®(XQxO01] 809091] = COO) ) 
COP = Ve Chnole ¢ HHeEDPENOT) « AWeENQ) 
ieee et, )6CUL FREE) 6 COHNV = 1 
Bee, = EUG) ¢« COR 
E HOH eteet CIENERATOR STATE VaARTABLES Seen s 


C ! 


> P- 


je 





Ser UTINE CHASTIC NORMAL 


a = ~~ —-— «= 
s 


ea oo me - eee oe _ 


ee — «= == o - —-——m— wae + 


100 - 


G pare oS EXCITER/SREGULATCR STATE VARIARLES 
oo OVANL= TAIOL4#(VO014VSOlL“VO)=TEINO] & (VFO1-VNEQ1) -=-VAO] 
He WeVANOT 2b Te O20 eAND.OVAG] oLT. 0.0 ) DVAOY = 0.0 
TF (VAL «GT. VAMXOL «AND. DVAOL «GT. O69 ) DVAQ] 
Meme SCC OR FE UEC VAOLP + HHOEDVAG)) «© AweVAg) 
[Ee COee.Giae ERR ja comwy = } 
VAN} = VAG) *¢ COR 
- - QVOEO] = TEIOLOE(VEFOlH-VNED) ) - - “tote 
COR = VOCVOEOITRP + HHeDVOED}) «© AWevDEQ) 
Tee CCR Sore. ERR } ICCNV = } 
VOEO) = VOEO) + COR Wee ewe ao 
THPO = AHS (VFO1L) = ESQ} 
LE Te Orel 06 0 2): PO = SOO 
SVFOL = TMPOrRTMPOPCSTOL*SGI(VFO)) 
DVFO) = TEIOLSA(VAOL HTKECIYOVEFOL-SVFQ1) 
COR = we(VEOIP + HHENVEFO} ) + AW<VFO! 
air { CGerGi. MeRnR YM CONV = 1. Eo Ne aaeee 
VFOQL = VFOLl ¢ COR . 
C ; 
c Ss EXCITCR/REGULATOR STATE VARIARLES -. -- 
G = 
Y20] = GEG) #01°TI T4602 
2 -- YlG@ = Pl enre1yeoweVS2e0C)) yesh =aeies - 
mOOTw= VIITSOr*y¥10) +¢ SMTTOLeVSIO} 
C 
rc OM MOUES -- SUPPLEMENTARY STATE VARTABLES 
C 
OVS20] = YQ) 
COR = welvSeulP « HH*DVSZ0}) + AWeyS7Zo} - 
Dee Cont «GT. ENR )Y ICCNV = } , 
VS29) = VS270) ¢ COR 
UVS101 = T47101¢¢(¥1LO1-VS10)) 
COR = WHIVS}OLP + HHeENVSIOL ) « AtiaVS101 
bree CeReeGT. FRR ) ICONV = 1 
oe SINE] someVS TO} «© COR aan,” oars 
DVS0QQ] = YQ) = VSCO}#TI690} 
COR = W*(IVSOQOJP « HHEOVYSCO) ) +¢ Alv¥#VSO01 
Meee e«Gle ERR ) ICG ==] Ds 
VSCO) = VSOAO1 ¢ COR 
; $b ib ab 4 dh Ge de SUPPLEMENTARY STATE VARTABLES 
: VSO) = ThG?#YOO) *© CMTTENL®VSCO)] 
1F ¢« VS)) GT, VSMO1) VSO = VSMO} 
IF ( VSa) eLTe~VS'0}) VSO1 =-VS‘i0) 
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V00}1 


= ENCL=KAQ1LFeCDO} « XQQ0}"CQOGI 
VQ0Q1l = EQG) = RAQJ9CGOL = ADOL4eCDN] 
VC) )) = 


1 SIN(THO)) ) 
. VOI = SQRT(VDO14VD01 +« VOO1eVQ0) ) 


MeroyT = PTO} -7T1TKO1*WwO) 
A201 = P?C) + POO) 


s 
ae eae ener 
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Ree Ae eetels 10.0 ) 201 = 060 
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— wee eee le 


OOOO TURBINE / GOVERNOR STATE VARIABLES eooa ae 


AAD 


OPIC} =—TUIO)&(CKI9O)+X101] ) ; ~- : os: -- sie - 
COP = wWe(P1IOP «© HHYDP1O] ) «© AWeP1Q0) 
“— -- - P1lO} = PIC) +* COR - “ose = ; - Gi - -- -% a 
DPec) = T3I1014(X101-h201) 
COR = We(PFQO)P «¢ HHENP?O}] ) + AWeP2ZgO)} 
eee IF MPCOR .GT. ERR ) ICONV = } 5 
PEO), = P?ZO) «© COR 
NP4&O) = IRIALGE(OMK?ZOL ©x20] -P4Q0)) 
COR = wel(PGOQIP + HEHEOEPGO) ) + AWePp4go) 
MeO CCRALGT. ERR 2 ICONV = 1 
P40} = P4O}Y «+ COR 


-- 7 - a . - _- . —- 


ee -- C — Maen RES oe eats eee ‘ 
C Seateatht to TURBINE 7 GOVERNOR STATE VARIABLES POGHKOE 
c , 
oe PHO) = P4C1) *¢CKx2e01] 4X26) ee oe ee 
c 
c Otro oY ‘NETWORK EQUATISCNS COs SOG 
a —— ? << = SP ee = 
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C Feo ueve NROW STORES THE NUMBER SF MONZERS ELEMENTS OF EACH ROW 2#aeeuen 
— G a : bs 2 : Z sa 5 
Cc GORE RUSY IV THE? POPNTS TC THE APPROIATE NOCDE VCE TAGE sh Ceeee er ome eet 
C Gertec's NOM ZERO ENTRY IN Y GOatOS 
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VEO) 
VAC) 
TMPA 
VOCE) 
VOrFOI 
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Ci 2 «Phe VEO} --- ---.- 
THPO «+ HEHeOVEO) 
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¥35101 
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VScuy] 
TMPO 
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Ce PHHPIVAQ) 
¢ HH*OVAO) 
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om 
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Ane DVS 201 
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26 SHHYNVSOO] 
HH*DVSO0] 
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2e*HH*UP1Q) 
HHIOPITO) 


TMEQ « 
Pos 
PoC) 
THPO 
PAC} 
PG] 
Tee 
P4Q) 
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Ce *HHYDP2Q) 
4 HHtOrP?Q) 
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ROMAMCE TIME STEP 
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om be WPNT ) GOTS, 2 
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2 eee — ee me eee 
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SUBRQUTINE GUASS (PsV.CrYsIB) 


eo¢g0d t 


THIS USES THE SORDECED BLOCK DIAGONAL FORM TO ALLOW 
S$oeowvwv 


GUASS SEIDEL ITERATTONS TO RE PERFORMED IN PARALLEL 


GOBKRHGE 
SOoOndaen 


COMPLEX CyVeYoSUMSCOXVONE aT 4p 


DIMENSTON YC(1U0, 29) .V(100) 471100) ,7H (10) sC (100) yNROW(100),7V(109,100) 
1 20) 9TV (29,20) sNDR(100) sIVP (20420) 


HEHHdd & 
POHUHGH 
HHEHveELHASG 


OATA SHOULD BE INSERTED HERE AOHGHOO 
ALL INFORMATION IS ASSUMED TO BE PRESENT AND FILLEN IN 
RY THE DATA CAROS HHOGHED 


eeeoe Y US STORFN IN SPARSE FORM, THE FIRST ELEMENT OF Han 000 
Soares FACH HOW ITS THE INVFOSE OF THE DIAGONAL ELEMENT HaddeD 
= adil A LL OTHER NOANZERG ELEMENTS ARE COMPRESSED TO THE NEXT PEGHGae 
HEEUWOS START OF PARALLEL ITERATIONS HHOGHOE 
BOO Vi ALL PREPAGATIONS M*0E PRIOR TO ENTERING LOOP HHHDOS 
Had O 6 & TIME WILL RE ADVANCE™ RY CODE AFTER CONVERGENCE HEHNDID 
cae wee CODE WILL REINTIALTIZE POR NEW Tie Sie eas a! I 
row = THR) ; 
Tiel = JAtNel) * } | ee 
IBS = IR (tel) 1 
I8Bl = IB(7) . 
IPNT = 0 yn. ee ater: 3 . 
KNT = 06 
TCONV = 4 
COnl = OCFALI(C(T)) ®SIN( THO) - AIMG(C(I)) ®COS(THOL) — 
COOL = AFAL(C(T)) *COS(THOL) + AIMG(C( LY) SSIN(ITHOL) 
eHAeceoss GENERATOR STATE VYARTARPLES oie 
OTHO] = Wl 
COR =WO(TKQIP © HHEOTHO]L ) + AWSTHO] 
freee GOR {GI.. ERR ) [CONV = } 
THO! = THOT + COR 
OWO] = RMA1L* (P40) ~(ZAP0}#CQ01+EDPOL#CDO1 *AQX001 FCN019C00)) 
lee OCLPUOT ) 
COR = We(unhlP * HHeOWOl) *AWeudl 
IF € CCR (GT. ERR } I1C0RY = 1 
WO] = W001 * COR 
DEQQ] = TINVI*(VFOL“FQ01L-XOXNC1LHCNI1 ~ SO) ) 
COR = W9(FQ0IP ¢ HHENEO] ) + AWFEGO] : 
IF ( COR ,GT. ERR ) ICONV = 1 
EQQ01 = FN! + COR : 
DEOGL = T19918(KAXENI8CO0] - FOOL ) 
COR = W*e(FNO1P «+ HhENEDOL) © AWYEDOI 
IF ( COR ,GT, ERR } ICONV = 1 = 
E00L = ENOL « COR 
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Wares oat h% : 
ae as a ae ne wm, AA OD =3.5 ho oon S56 60000 BOCAS e CS OOD Nene. pesiee 
GUASS HORMAL COC 6600 FIN V3.0-P336 OPT=2 07/21/76 
il GENERATOR STATE VARIABLES gas 
V001 ENni-RADL9CDO1 + XQ01#7000} 


AAD 


oie ke) 


On” 


Aaa 





VGOl = ENN] ~ RADLeCHO]L ~ xp01e¢001 


V(T) CMOLA(VDOL®SIN(THO1) eVAOLECNS (THOL) »=-VO018COS(THOL) #YQ0i» 
1 eSIN (THOM ) 


VOl = SQRT(VNOleVD01 + Vadl¥Vvool ) 


a dia FACTTER/REGULATOR STATE VARTABLES me 


DVAQ1= TATOL#(VOOLFVSOL“VOLSTFIKOI®(VFOI“VDEOL) <VAOl ) 
fee vnte tT, UL 0 CAND. DVAQ] JLT. 0.0 ) OVAOl = 0,0 

Pr OVA0) SGT. WAUXIO] SANID. OVW .GT. 0.0 ) OVA0! = 0,0 
COR = WY (VAUIP « HH#®nVA0]) + AWeVAG) 

We { COR’ .GT. ERR ) POEONV =") 

VAOl = Vadl + COR 

DVOEO] = TFICLF(VEFO1-VDEO] ) 

COR = We (vNEGIP +» HHeDVOEODL) + AW*VDEQ) 

Meee. CORD. GI, ERR ) ICONV = 3 

VOEOi = vnFdI * COR 

TMPQ = ARS(VFOL) = ESO01 

eC INPn .T. OO meTIPO = 0.0 

SYFOL = TMPOSTSPOSCSTOQLESGEN(VFQL) 

DVF QO) = TFEICIMIVADL<TKEQL OVE O1~SVFQ)) 

COR = W®(VFOIP * HH*NVFO] ) + AYOVEO) 

ie (CORD Gly -FRR.) ICONV =.) 

VF Ol = YFr)] « COR 


eeavane EXCITOR/REGULATOR STATE VARIABLES | HHoooG 
Y2O01 = GkKALHWOLSTITS6N) 
mI) = T1991*t%enlvs201) 
WOO] = T37TSORY1Ob ,wONTTOeVS 101 

POHORad SUPPLEMENTARY STATE VARIABLES Sooneo 


OVS201 = Y101 

COH = We(VSZ01P *¢ HHeDvS201) +* AVW#VS201 
Wm ( COR oGT.» ERR ) ICONV = } 
vS20} = vs201 * COR 

Sms) = Wolo (Y101-VvS101) 
COR = WHlvSlHIP + HHeOVSlol ) 
Sime COR (GTO ERR ) TCONY =) 
W516) = VS10} * COR 

OV50nN1 = ¥901 - VS001*%TI601 
COR = wWe(vSdGIP + HHapys90l ) + AWsVS001 
IF ( COR ,GT. ERR ) ICONV = 1 

V¥S001 = VS00Me* COR 


¢ AWeVS1Q1 


SHR arose SUPPLEMENTARY STATE VARTABLES HOURS 


VSO1 = TR418YO01 +» OMTTH601LEVS*OL 
IF ( VSO} .GT. VSNOL) VSOL = VSH01 
IF ( VSO1 ,~LTe“VSM01) ¥SOl =-VSNOl 
XiG) = PIAL -TIKOIeWoOl 
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Ze = PPh) OPP ON) 


lee (mee) GUT. 0.0 3) %201 = O20 
, IF © X201 GT. PMXO) ® X201 = PMXO! 
C sen eed TURBINE / GOVERNOR STATE VARIABLES ' ooaune 
C : 
DPLOL ==THITIOL@ICKIO1 +X101 ) 
COR = W#®(PIOIP * HHenPl1O)l ) « AW#P101 
We @ COW CGT. «FRR ) ICONV = } 
Pi01 = P191 + COR 
OP201 = T3T01%(xX101-P201) 
COR = We (P201P + HHENP2ZD) ) « AWtP20} 
Ir ( COR .GT. EPR ) ICONV = 1 
Peuyl = P20} «+ COR 
DP401 = TKTO1#(OMK2018#X201-P4r.) 
COR = wW%(P401P + HHEeNP4SO] ) AWeP40] 
me. ( GOR GT. ERR ) WeONVes 2 
P401 = P4001 « COR 
C : 
Cc FEAL TURBINE 7 GOVERNOR STATE VARIABLES eeiows 
c 
PHOl = P4éa} #CK2018X%701] 
c 
a H$HSLHASD | NETWORK EQUATIONS : HAV GRY SG 
C 
Cc 
C ReHeoos NROW STORES THE NUMRER OF NONZERO ELEMENTS OF EACH ROW vesoroe 
C 
oo athlete Iv FHEN POPNTS TO THE APPROTATE MOUE VOLTAGE FOR EACH ®#8tso 
C seen MONZERO ENTRY IN Y a0 Cake 
— 
c 
SHHKUHS THE NETWORK EQUATION® ARE SOLVED RY BLOCKS HOHHOHES 
ee © Tee =F TRSt BLOCK 1S THE RLOCK LAST ROW FOR THE INTER TOW 
OHH RDYD RESULTS THAT WILL ®E SUMMED BY THE LAST PROCESSOR OH ODNTOOS 
C 


De vos J = ITUN,IBE 
Sigs (0,,0.) = 

Por = BQOW (jie oo 
DO 110 L = loewWR 

Jae ae NiCd) 

110 SUM = SUM * Y¥(JeL) 8V (Uy) i om im : 
eS TVldel) = SUM 


[- *) Cw (7, Jer 


eS - awe: 
bir ( ROW C s) eit. 0 ) (TS ey 





Coe seo LAST BLOCK PARTIALLY SOLVED BY EACH PROCESSOR SHUaa 

c 

C 

GLH HGHED TRE PROCESSOR ASSIGSED To THE LAST BLOCK MUST SUM HOOBOHDIS 
HYeOIWeES ALL INTERAMTDATE RESULTS FROM ALL PROCESSORS TO ee 
came oe FIND THE ITERATION PESULTS FOR THE CUT SET VARTARLES ale 
C 

ere te SYNCHRONIZATION REQUIRED HERE 

C a 4 Bie 
aovuoged THE NEXT SECTION OF CODE APPLIES ONLY TO THE PROCESSOR senavere 


Ben crninite © ASSIGNED To THE CUT SET VARIABLES seGHGaLLIE 
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URROUTINE GUASS NORMAL CNC 6600 FIN V3B.0-193396 OPT=?2? A7/21/7% 
: c 
DO 200 J = IRN , IE 
NR = NOR(4) 
00 205 L = 2,NR 
ie = IVC) 
205 SUM = SUM « Y(JoL) ®VCLL) 
NR = NROW (J) 
0O 210 tL = 1-NR 
LL = IVP (.j-ITRNoL) 
SUM = SUM # TV(JeLL) 
210 TVJelLL) = MINF 
SUM = PC(T)®Y( Tol) /SCOMNUG(VII)) = SUM 
COA = WHSUMAAN FV (CT) 
IF( CABS (COX) »GT. ERR ) ICONV = 1 
200 V(U) = Vg) «¢ COX 
SHono AS END OF CODE FXCLUSIVELY FOR PROCESSSOR ASSIGNED TO CUT SET evan 
00 300 J = I85 , IRI 
SUM = (9,,0e) 
NR = NROW (J) 
00 305 L = 2eNR 
Jy = Ivtu,l) 
305 SUM = SUN #© YVYldel) Vv (yy) 7 
mae Ls («Ge LOEN ) GOTO 311 
CJ) = (Stim « ONE) SY (Jel) © PLO) SCONIGIVGJ)) 
31] SUM = PCT)®AYCT®1)/COMNUIGIVC(I)) = SUM 
COX = WeStim-AVeEVOT) 
Tree CarotcOx) 261, FRR ' ICONV = } 
300 Yau) = J(j) + COX 
Cc 
enooas SYNCHRONIZATION REQUIRED HERE 
Cc voRonD NETWORK EQiJATIONS _ vooaseo 
\Y 
KNT = KNT # } : 
IF ( ICONY .NE- 0 ) GOTO 50 
C 
‘a eer ADVANCE TIME STEP sea veo 
: THPQ = THo]l 
THOL = THAL © 2.%HH#NTHO) 
THOIP = TPO * HH*®OTHO) 
THRO = NO] 
WO} = W0L + 2e*HHYOWO) 
WOlP = TyHo0Qg + HHeDWwd)] 
THPO = £091 
FQ0L = EQ41 «© 2.?HH*NEQOL 
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